convolutional neural networks with intermediate loss for ... · [16], [17], [18], we approach...

10
1 Convolutional Neural Networks with Intermediate Loss for 3D Super-Resolution of CT and MRI Scans Mariana-Iuliana Georgescu, Radu Tudor Ionescu, Member, IEEE, and Nicolae Verga Abstract—Computed Tomography (CT) scanners that are commonly-used in hospitals and medical centers nowadays pro- duce low-resolution images, up to 512 pixels in size. One pixel in the image corresponds to a one millimeter piece of tissue. In order to accurately segment tumors and make treatment plans, radiologists and oncologists need CT scans of higher resolution. The same problem appears in Magnetic Resonance Imaging (MRI). In this paper, we propose an approach for the single- image super-resolution of 3D CT or MRI scans. Our method is based on deep convolutional neural networks (CNNs) composed of 10 convolutional layers and an intermediate upscaling layer that is placed after the first 6 convolutional layers. Our first CNN, which increases the resolution on two axes (width and height), is followed by a second CNN, which increases the resolution on the third axis (depth). Different from other methods, we compute the loss with respect to the ground-truth high-resolution output right after the upscaling layer, in addition to computing the loss after the last convolutional layer. The intermediate loss forces our network to produce a better output, closer to the ground-truth. A widely-used approach to obtain sharp results is to add Gaussian blur using a fixed standard deviation. In order to avoid overfitting to a fixed standard deviation, we apply Gaussian smoothing with various standard deviations, unlike other approaches. We evaluate the proposed method in the context of 2D and 3D super- resolution of CT and MRI scans from two databases, comparing it to relevant related works from the literature and baselines based on various interpolation schemes, using 2× and 4× scaling factors. The empirical results show that our approach attains superior results to all other methods. Moreover, our human annotation study reveals that both doctors and regular annotators chose our method in favor of Lanczos interpolation in 97.55% cases for an upscaling factor of 2× and in 96.69% cases for an upscaling factor of 4×. In order to allow others to reproduce our state-of-the-art results, we provide our code as open source at https://github.com/lilygeorgescu/3d-super-res-cnn. Index Terms—Convolutional neural networks, single-image super-resolution, CT images, MRI images, medical image super- resolution. I. I NTRODUCTION M EDICAL centers and hospitals around the globe are typically equipped with single-energy Computer To- mography (CT) or Magnetic Resonance Imaging (MRI) scan- ners that produce cross-sectional images (slices) of various body parts. The resulting images are of low-resolution (typi- cally around 256 pixels) and one pixel usually corresponds to a one millimeter piece of tissue. The thickness of one slice is also one millimeter, so the 3D CT images are composed M.I. Georgescu and R.T. Ionescu are with University of Bucharest, Roma- nia, E-mail: [email protected]. N. Verga is with Colea Hospital and “Carol Davila” University of Medicine and Pharmacy, Romania. Manuscript received December 19th, 2019; revised Month XX, 2020. Fig. 1. Our method for 3D image super-resolution based on two subsequent fully-convolutional neural networks. In the first stage, the input volume is resized in two dimensions (width and height). In the second stage, the processed volume is further resized in the third dimension (depth). Using a scale factor of 2×, an input volume of 256 × 256 × 64 components is upsampled to 512 × 512 × 128 components (on all axes). Best viewed in color. of volumetric pixels (voxels) that correspond to one cubic millimeter (1 × 1 × 1 mm 3 ) of tissue. One of the main benefits of this non-invasive scanning technique is that it allows doctors to see if there are malignant tumors inside the body. Nevertheless, doctors, and even machine learning systems [1], are not able to accurately contour (segment) the tumor regions because of the low-resolution of CT or MRI scans. According to a team of radiologists from Colea Hospital in Bucharest, that provided a set of anonymized CT scans for our experiments, the desired resolution is to have one voxel correspond to one cubic micrometer (a thousandth part of a cubic millimeter) of tissue. In other words, the goal is to increase the resolution of 3D CT and MRI scans by a factor of 10× in each direction. The main motivation behind our work is to allow radiolo- gists and oncologists to accurately segment tumors and make better treatment plans. In order to achieve the desired goal, we propose a machine learning method that takes as input a 3D image and increases the resolution of the input image by a factor of 2× or 4×, providing as output a high-resolution 3D image. To our knowledge, there are only a few previous works [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18] that study the super-resolution of CT or MRI images. Similar to some of these previous works [2], [3], [4], [5], [6], [7], [9], [11], [12], [13], [14], [15], [16], [17], [18], we approach single-image super-resolution of CT and MRI scans using deep convolutional neural networks (CNNs). CNN models rely on end-to-end training from large arXiv:2001.01330v1 [cs.CV] 5 Jan 2020

Upload: others

Post on 17-Aug-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Convolutional Neural Networks with Intermediate Loss for ... · [16], [17], [18], we approach single-image super-resolution of CT and MRI scans using deep convolutional neural networks

1

Convolutional Neural Networks with IntermediateLoss for 3D Super-Resolution of CT and MRI Scans

Mariana-Iuliana Georgescu, Radu Tudor Ionescu, Member, IEEE, and Nicolae Verga

Abstract—Computed Tomography (CT) scanners that arecommonly-used in hospitals and medical centers nowadays pro-duce low-resolution images, up to 512 pixels in size. One pixelin the image corresponds to a one millimeter piece of tissue. Inorder to accurately segment tumors and make treatment plans,radiologists and oncologists need CT scans of higher resolution.The same problem appears in Magnetic Resonance Imaging(MRI). In this paper, we propose an approach for the single-image super-resolution of 3D CT or MRI scans. Our method isbased on deep convolutional neural networks (CNNs) composedof 10 convolutional layers and an intermediate upscaling layerthat is placed after the first 6 convolutional layers. Our first CNN,which increases the resolution on two axes (width and height),is followed by a second CNN, which increases the resolution onthe third axis (depth). Different from other methods, we computethe loss with respect to the ground-truth high-resolution outputright after the upscaling layer, in addition to computing the lossafter the last convolutional layer. The intermediate loss forces ournetwork to produce a better output, closer to the ground-truth. Awidely-used approach to obtain sharp results is to add Gaussianblur using a fixed standard deviation. In order to avoid overfittingto a fixed standard deviation, we apply Gaussian smoothingwith various standard deviations, unlike other approaches. Weevaluate the proposed method in the context of 2D and 3D super-resolution of CT and MRI scans from two databases, comparingit to relevant related works from the literature and baselinesbased on various interpolation schemes, using 2× and 4× scalingfactors. The empirical results show that our approach attainssuperior results to all other methods. Moreover, our humanannotation study reveals that both doctors and regular annotatorschose our method in favor of Lanczos interpolation in 97.55%cases for an upscaling factor of 2× and in 96.69% cases for anupscaling factor of 4×. In order to allow others to reproduceour state-of-the-art results, we provide our code as open sourceat https://github.com/lilygeorgescu/3d-super-res-cnn.

Index Terms—Convolutional neural networks, single-imagesuper-resolution, CT images, MRI images, medical image super-resolution.

I. INTRODUCTION

MEDICAL centers and hospitals around the globe aretypically equipped with single-energy Computer To-

mography (CT) or Magnetic Resonance Imaging (MRI) scan-ners that produce cross-sectional images (slices) of variousbody parts. The resulting images are of low-resolution (typi-cally around 256 pixels) and one pixel usually corresponds toa one millimeter piece of tissue. The thickness of one sliceis also one millimeter, so the 3D CT images are composed

M.I. Georgescu and R.T. Ionescu are with University of Bucharest, Roma-nia, E-mail: [email protected].

N. Verga is with Colea Hospital and “Carol Davila” University of Medicineand Pharmacy, Romania.

Manuscript received December 19th, 2019; revised Month XX, 2020.

Fig. 1. Our method for 3D image super-resolution based on two subsequentfully-convolutional neural networks. In the first stage, the input volume isresized in two dimensions (width and height). In the second stage, theprocessed volume is further resized in the third dimension (depth). Usinga scale factor of 2×, an input volume of 256 × 256 × 64 components isupsampled to 512 × 512 × 128 components (on all axes). Best viewed incolor.

of volumetric pixels (voxels) that correspond to one cubicmillimeter (1 × 1 × 1 mm3) of tissue. One of the mainbenefits of this non-invasive scanning technique is that itallows doctors to see if there are malignant tumors insidethe body. Nevertheless, doctors, and even machine learningsystems [1], are not able to accurately contour (segment) thetumor regions because of the low-resolution of CT or MRIscans. According to a team of radiologists from Colea Hospitalin Bucharest, that provided a set of anonymized CT scans forour experiments, the desired resolution is to have one voxelcorrespond to one cubic micrometer (a thousandth part ofa cubic millimeter) of tissue. In other words, the goal is toincrease the resolution of 3D CT and MRI scans by a factorof 10× in each direction.

The main motivation behind our work is to allow radiolo-gists and oncologists to accurately segment tumors and makebetter treatment plans. In order to achieve the desired goal,we propose a machine learning method that takes as input a3D image and increases the resolution of the input image bya factor of 2× or 4×, providing as output a high-resolution3D image. To our knowledge, there are only a few previousworks [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12],[13], [14], [15], [16], [17], [18] that study the super-resolutionof CT or MRI images. Similar to some of these previousworks [2], [3], [4], [5], [6], [7], [9], [11], [12], [13], [14], [15],[16], [17], [18], we approach single-image super-resolution ofCT and MRI scans using deep convolutional neural networks(CNNs). CNN models rely on end-to-end training from large

arX

iv:2

001.

0133

0v1

[cs

.CV

] 5

Jan

202

0

Page 2: Convolutional Neural Networks with Intermediate Loss for ... · [16], [17], [18], we approach single-image super-resolution of CT and MRI scans using deep convolutional neural networks

2

amounts of data in order to achieve state-of-the-art results.We propose a CNN architecture composed of 10 convolutionallayers and an intermediate sub-pixel convolutional (upscaling)layer [19] that is placed after the first 6 convolutional layers.Different from related works [3], [6], [16], [18] that usethe sub-pixel convolutional layer of Shi et al. [19], we add4 convolutional layers after the upscaling layer. In orderto obtain 3D super-resolution, we employ two CNNs withsimilar architectures, as illustrated in Figure 1. The first CNNincreases the resolution on two axes (width and height), whilethe second CNN takes the output from the first CNN andfurther increases the resolution on the third axis (depth), thusincreasing the resolution on all three axes. Different fromrelated methods [3], [6], [16], [18], we compute the loss withrespect to the ground-truth high-resolution output right afterthe upscaling layer, in addition to computing the loss afterthe last convolutional layer. The intermediate loss forces ournetwork to produce a better output, closer to the ground-truth.In order to improve the results and obtain sharper images, acommon approach is to apply Gaussian smoothing on top ofthe input images, using a fixed standard deviation. Different forother medical image super-resolution methods [3], [14], [17],we use various standard deviations in order to avoid overfittingto a certain standard deviation and improve the generalizationcapacity of our model.

We conduct super-resolution experiments on two databasesof 3D CT and MRI images, one gathered from the ColeaHospital (CH) and one that is publicly available online,known as NAMIC1. We compare our method with severalinterpolation baselines (nearest, bilinear, bicubic, Lanczos) andstate-of-the-art methods [3], [13], [17], in terms of the peaksignal-to-noise ratio (PSNR) and the structural similarity index(SSIM). We perform comparative experiments on both 2Dand 3D single-image super-resolution under commonly-usedupscaling factors, namely 2× and 4×. The empirical resultsindicate that our approach is able to surpass all the othermethods included in the experiments. For example, on theNAMIC data set, we obtain a PSNR of 40.57 and an SSIMof 0.9835 for 3D super-resolution by a factor of 2×, whilePham et al. [13] reported a PSNR of 38.28 and an SSIM of0.9781 in the same setting. Furthermore, we conduct a humanevaluation study, asking 6 doctors and 12 regular annotators tochoose between the CT images produced by our method andthose produced by Lanczos interpolation (the best interpolationmethod). The annotators opted for our method in favor ofLanczos interpolation in 97.55% cases for an upscaling factorof 2× and in 96.69% cases for an upscaling factor of 4×.These results indicate that our method is significantly betterthan Lanczos interpolation. To our knowledge, we are the firstto conduct a human evaluation study on the super-resolution ofCT scans. In order to allow further developments and resultsreplication, we provide our code as open source in a publicrepository2.

To summarize, our contribution is threefold:• We propose a novel CNN model for 3D super-resolution

1Available at http://hdl.handle.net/1926/1687.2Available at https://github.com/lilygeorgescu/3d-super-res-cnn.

of CT and MRI scans, which is based on an intermediateloss added to the standard output loss and on smooth-ing the input using random standard deviations for theGaussian blur.

• We conduct a human evaluation study to determine thequality and the utility of our super-resolution results.

• We provide our code online for download, allowing ourresults to be easily replicated.

We organize the rest of this paper as follows. We presentrelated art in Section II. We describe our method in detail inSection III. We present experiments and results in Section IV.Finally, we draw our conclusions in Section V.

II. RELATED WORK

The purpose of single-image super-resolution (SISR) isto reconstruct a high-resolution (HR) image from its low-resolution (LR) counterpart. Before the deep learning age, re-searchers have used exemplar-based or sparse coding methodsfor SISR. Exemplar-based methods learn mapping functionsfrom external LR and HR exemplar pairs [20], [21], [22].Sparse coding methods [23] are representative for externalexemplar-based SR methods. For example, the method ofYang et al. [23] builds a dictionary with LR patches and thecorresponding HR patches.

To our knowledge, the first work to present a deep learningapproach for SISR is [24]. Dong et al. [24] proposed aCNN composed of 8 convolutional layers. The network wastrained in an end-to-end fashion, minimizing the reconstructionerror between the HR image and the output of the network.They used bicubic interpolation to resize the image, beforegiving it as input to the network. Hence, the CNN takes ablurred HR image as input and learns how to make it sharper.Since the input is a HR image, this type of CNN is timeconsuming. Therefore, Shi et al. [19] introduced new a methodfor upsampling the image using the CNN activation mapsproduced by the last layer. Their network is more efficient,because it builds the HR image only at the very end. Otherworks, such as [25], proposed deeper architectures, focusingstrictly on accuracy. Indeed, Zhang et al. [25] presented oneof the deepest CNN used for SR, composed of 400 layers.They used a channel attention mechanism and residual blocksto handle the depth of the network.

For medical SISR, some researchers have focused on sparserepresentations [8], [10], while others on training convolu-tional neural networks [1], [2], [3], [4], [5], [6], [7], [9], [11],[12], [14], [15], [16], [18].

The authors of [8] proposed a weakly-supervised joint con-volutional sparse coding method to simultaneously solve theproblems of super-resolution and cross-modal image synthesis.In [10], the authors adopted a method based on compressedsensing and self-similarity constraint, obtaining better resultsthan [17] in terms of SSIM and PSNR.

Some works [1], [3], [5], [6], [9], [11], [10], [14], [15],[16], [18] focused on 2D upsampling, i.e. on increasing thewidth and height of CT/MRI slices, while other works [2],[4], [8], [12] focused on 3D upsampling, i.e. on increasing theresolution of full 3D CT/MRI scans on all three axes (width,height and depth).

Page 3: Convolutional Neural Networks with Intermediate Loss for ... · [16], [17], [18], we approach single-image super-resolution of CT and MRI scans using deep convolutional neural networks

3

For 2D upsampling, some works [1], [5], [9], [14] usedinterpolated low resolution (ILR) images, while other works[3], [6], [16], [18] used the efficient sub-pixel convolutionalneural network (ESPCN) introduced in [19]. Similar to thelatter approaches [3], [6], [16], [18], we employ the sub-pixelconvolutional layer of Shi et al. [19]. Different from theserelated works [3], [6], [16], [18], we add a convolutional blockafter the sub-pixel convolutional layer, in order to enhancethe HR output image. Furthermore, we propose a novel lossfunction for our CNN model. Instead of computing the lossbetween the output image and the ground-truth high-resolutionimage, we also compute the loss between the intermediateimage given by the sub-pixel convolutional layer and the high-resolution image. This forces our neural network to learn abetter intermediate representation, increasing its performance.

There are some works [2], [11], [15] that employed genera-tive adversarial networks (GANs) [26] to upsample medicalimages. Our approach based on fully-convolutional neuralnetworks is less related to GAN-based SISR methods.

For 3D upsampling, Chen et al. [2] trained a CNN with 3Dconvolutions and used a GANbased loss function to producesharper and more realistic images. In order to upsample a3D image, Du et al. [4] employed a deconvolutional layercomposed of 3D filters to upsample the LR image, in anattempt to reduce the computational complexity. As [2], [4],[8], [12], we tackle the problem of 3D CT/MRI image super-resolution. However, instead of using inefficient 3D filters toupsample the LR images in a single step, we propose a two-stage approach that uses efficient 2D filters. Our approachemploys a CNN to increase the resolution in width and height,and another CNN to further increase the resolution depth-wise.

Most SISR works [3], [14], [17], apply Gaussian smoothingusing a fixed standard deviation on the training images, thustraining the models in more difficult conditions. However, webelieve that using a fixed standard deviation can harm theperformance, as deep models tend to overfit to the trainingdata. Different from the standard methodology, each time weapply smoothing on a training image, we chose a differentstandard deviation, randomly. This simple change improvesthe generalization capacity of our model, yielding better per-formance at test time.

While many works focus only on the super-resolution task,the work of Sert et al. [1] is focused on the gain broughtby the upsampled images in solving a different task. Indeed,the authors [1] obtained an improvement of 7.5% in theclassification of segmented brain tumors when the upsampledimages were used.

We note that there is also some effort in designing andobtaining CT scan results of higher resolution directly from CTscanners. For example, X-ray microtomography (micro-CT)[27], which is based on pixel sizes of the cross-sections in themicrometer range, has applications in medical imaging [28],[29], but such micro-CT scanners are far more expensive thanstandard CT scanners. Hence, the vast majority of hospitalsrely on standard CT scanners which can provide a macroscopicimage only.

Another alternative to standard (single-energy) CT, is dual-energy or multi-energy CT [30]. In dual-energy CT, an addi-

tional measurement is obtained with a second X-ray spectrum,allowing the differentiation of multiple materials that cannotbe distinguished in single-energy CT. Although several studiespresent the benefits of dual energy CT [30], [31], it hasremained underutilized over the past decade probably due tonovelty of the medical methodology and higher costs thansingle-energy CT scanners.

Different from micro-CT or dual-energy CT, our focus is toincrease the resolution of single-energy CT images using analgorithm based on machine learning.

III. METHOD

Our approach for solving the 3D image super-resolutionproblem is divided in two stages, as illustrated in Figure 1.In the first stage, we upsample the image on height and widthusing a deep fully-convolutional neural network. Then, in thesecond stage, we further upsample the resulting image on thedepth axis using another fully-convolutional neural network.Therefore, our complete method is designed for resizing the3D input volume on all three axes. While the CNN usedin the first stage resizes the image on two axes, the CNNused in the second stage resizes the image on a single axis.Both CNNs share the same architecture, the only differencebeing in the upsamling layer (the second CNN upsamples inonly one direction). At training time, our CNNs operate onpatches. However, since the architecture is fully-convolutional,the models can operate on entire slices at inference time.

We further describe in detail the proposed CNN architecture,loss function and data augmentation procedure.

A. Architecture

Our architecture, depicted in Figure 2 and used for bothCNNs, is composed of 10 convolutional layers (conv), eachfollowed by Rectified Liner Units (ReLU) [32] activations.All convolutional layers contain filters with a spatial supportof 3 × 3. The 10 conv layers are divided in two blocks. Thefirst block, formed of the first 6 conv layers, starts with theinput of the neural network and ends just before the upscalinglayer. Each of the first 5 convolutional layers are formed of32 filters. For the CNN used in the first stage, the number offilters in the sixth convolutional layers is equal to the squareof the scale factor, e.g. for scale factor of 4× the numberof filters is 16. For the CNN used in the second stage, thenumber of filters in the sixth convolutional layers is equal tothe scale factor, e.g. for scale factor of 4× the number offilters is 4. The difference is caused by the fact that the firstCNN upscales on two axes, while the second CNN upscaleson one axis. The first convolutional block contains a short-skipconnection, from the first conv layer to the third conv layer,and a long-skip connection, from the first conv layer to thefifth conv layer.

The first convolutional block is followed by a sub-pixelconvolutional (upscaling) layer, which was introduced in [19].In the upscaling layer, the activation maps produced by thesixth conv layer are assembled into a single activation map.Throughout the first convolutional block, the spatial size ofthe low-resolution input is preserved, i.e. the activation maps

Page 4: Convolutional Neural Networks with Intermediate Loss for ... · [16], [17], [18], we approach single-image super-resolution of CT and MRI scans using deep convolutional neural networks

4

Fig. 2. Our convolutional neural network for super-resolution on two axes, height and width. The network is composed of 10 convolutional layers and anupsampling (sub-pixel convolutional) layer. It takes as input low-resolution patches of 7× 7 pixels and, for the r = 2 scale factor, it outputs high-resolutionpatches of 14× 14 pixels. The convolutional layers are represented by green arrows. The sub-pixel convolutional layer is represented by the red arrow. Thelong-skip and short-skip connections are represented by blue arrows. Best viewed in color.

Fig. 3. An example of low-resolution input activation maps and the cor-responding high-resolution output activation map given by the sub-pixelconvolutional layer for upscaling on two axes. For a scaling factor of r = 2 inboth directions, the sub-pixel convolutional layer requires r2 = 4 activationmaps as input. Best viewed in color.

Fig. 4. An example of low-resolution input activation maps and the cor-responding high-resolution output activation map given by the sub-pixelconvolutional layer for upscaling on one axis. For a scaling factor of r = 2in one direction, the sub-pixel convolutional layer requires r = 2 activationmaps as input. Best viewed in color.

of the sixth conv layer have hI × wI components, where hIand wI are the height and the width of the input image I . Inorder to increase the input r times on both axes, the outputof the sixth conv layer must be a tensor of hI × wI × r2

components. The activation map resulting from the sub-pixelconvolutional layer is a matrix of (hI ·r)×(wI ·r) components.For super-resolution on two axes, the pixels are rearranged asshown in Figure 3. In a similar fashion, we can increase theinput r times on one axis. In this case, the output of the sixthconv layer must be a tensor of hI × wI × r components.

This time, the activation map resulting from the sub-pixelconvolutional layer can be either a matrix of (hI · r) × wI

components or a matrix of hI×(wI ·r) components, dependingon the direction we aim to increase the resolution. For super-resolution on one axis, the pixels are rearranged as shown inFigure 4. To our knowledge, we are the first to propose asub-pixel convolutional (upscaling) layer for super-resolutionin one direction.

When Shi et al. [19] introduced the sub-pixel convolutionallayer, they used it as the last layer of their CNN. Hence, theoutput depth of the upscaling layer is equal to the numberof channels in the output image. Since we are working withCT/MRI (grayscale) images, the output has a single channel.Different from Shi et al. [19], we employ further convolutionsafter the upscaling layer. In our architecture, the upscalinglayer is followed by our second convolutional block, whichstarts with the seventh convolutional layer and ends withthe tenth convolutional layer. The first three conv layers inour second block are formed of 32 filters. The tenth convlayer contains a single convolutional filter, since our outputis a grayscale image that has a single channel. The secondconvolutional block contains a short skip connection, from theseventh conv layer to the ninth conv layer. The spatial sizeof hO × wO components of the activation maps is preservedthroughout the second convolutional block, where hO and wO

are the height and the width of the output image O.

B. Losses and Optimization

In order to obtain a CNN model for single-image super-resolution, the aim is to minimize the differences betweenthe ground-truth high-resolution image and the output imageprovided by the CNN. Researchers typically employ the meanabsolute difference as the objective function. Given a low-resolution input image I and the corresponding high-resolutionoutput image O, the loss based on the mean absolute value isformally defined as follows:

L(θ, I, O) =

wO∑i=1

hO∑j=1

|f(θ, I)−O|, (1)

Page 5: Convolutional Neural Networks with Intermediate Loss for ... · [16], [17], [18], we approach single-image super-resolution of CT and MRI scans using deep convolutional neural networks

5

where θ are the CNN parameters (weights), f is the transfor-mation function learned by the CNN, and wO and hO representthe width and the height of the output image O, respectively.

When we train our CNN model, we do not employ the stan-dard approach of minimizing the difference between the outputprovided by the CNN and the ground-truth HR image. Instead,we propose a novel approach based on an intermediate lossfunction. Since the conv layers after the upscaling layer aremeant to refine the high-resolution image without taking anyadditional information from the low-resolution input image,we note that the high-resolution image resulted immediatelyafter the upscaling layer should be as similar as possible tothe ground-truth high-resolution image. Therefore, we proposea loss function that aims to minimize the difference betweenthe intermediately-obtained HR image and the ground-truthHR image, in addition to minimizing the difference betweenthe final HR image and the ground-truth HR image. Let f1denote the transformation function that corresponds to the firstconvolutional block and the upscaling layer, and let f2 denotethe transformation function that corresponds to the secondconvolutional block. With these notations, the transformationfunction f of our full CNN architecture can be written asfollows:

f(θ, I) = f2(θ2, f1(θ1, I)), (2)

where θ are the parameters of the full CNN, θ1 are theparameters of the first convolutional block and θ2 are theparameters of the second convolutional block, i.e. θ is aconcatenation of θ1 and θ2. Having defined f1 and f2 as above,we can formally write our loss function as follows:

Lfull = Lstandard + λ · Lintermediate, (3)

where λ is a parameter that controls the importance of theintermediate loss with respect to the standard loss, Lstandard

is the standard loss defined in Equation (1) and Lintermediate

is defined as follows:

Lintermediate(θ1, I, O) =

wO∑i=1

hO∑j=1

|f1(θ1, I)−O|. (4)

In the experiments, we set λ = 1, since we did not find anystrong reason to assign a lower or higher importance to theintermediate loss. By replacing λ with 1 and the loss valuesfrom Equation (1) and Equation (4), Equation (3) becomes:

Lfull(θ, I, O) =

wO∑i=1

hO∑j=1

|f(θ, I)−O|+wO∑i=1

hO∑j=1

|f1(θ1, I)−O|.

(5)In order to optimize towards the objective defined in Equa-

tion (5), we employ the Adam optimizer [33], which is knowto converge faster than Stochastic Gradient Descent.

C. Data Augmentation

A common approach to force the CNN to produce sharperoutput images is to apply Gaussian smoothing using a fixedstandard deviation at training time [3], [14], [17]. By trainingthe CNN on blurred low-resolution images, it will have aneasier task during inference, when the input images are no

Fig. 5. Distribution of training samples (represented by green triangles)and test samples (represented by red circles), when the training samplesare smoothed using a fixed standard deviation (left-hand side) versus usinga randomly-chosen standard deviation (right-hand side). In example (a),overfitting on the training data leads to poor results on test data. In example(b), the danger of overfitting is diminished because the test distribution inincluded in the training distribution. Best viewed in color.

longer blurred. However, smoothing only the training imageswill inherently generate a distribution gap between training andtesting data. If the CNN overfits to the training distribution,it might not produce the desired results at inference time.We propose to solve this problem by using a randomly-chosen standard deviation for each training image. Althoughthe training data distribution will still be different from thetesting data distribution, it will include the distribution of testsamples, as illustrated in Figure 5. In order to augment thetraining data, we apply Gaussian blur with a probability of0.5 (only half of the images are smoothed) using a kernel of3 × 3 components and a randomly-chosen standard deviationbetween 0 and 0.5.

IV. EXPERIMENTS

A. Data Sets

The first data set used in the experiments consists of 10anonymized 3D images of brain CT provided by the MedicalSchool at Colea Hospital. We further refer to this data set asthe Colea Hospital (CH) data set. In order to fairly train andtest our CNN models and baselines, we randomly selected6 images for training and used the remaining 4 images fortesting. The training set has 359 slices (2D images) and thetesting set has 238 slices. The height and the width of the slicesvary between 192 and 512 pixels, while the depth of the 3Dimages varies between 3 and 176 slices. The resolution of avoxel is 1× 1× 1 mm3.

The second data set used in our experiments is the NationalAlliance for Medical Image Computing (NAMIC) Brain Multi-modality data set. The NAMIC data set consists of 20 3D MRIimages, each composed of 176 slices of 256× 256 pixels. Asfor the CH data set, the resolution of a voxel is 1×1×1 mm3.For our experiments, we used T1-weighted (T1w) and T2-weighted (T2w) images independently. Following [13], wesplit the NAMIC data set into a training set containing 103D images and a test set containing the other 10 images. Wekept the same split for T1w and T2w.

B. Experimental Setup

1) Evaluation Metrics: As most previous works [3], [5],[10], [12], [13], [14], [15], [16], [17], [18], we employ

Page 6: Convolutional Neural Networks with Intermediate Loss for ... · [16], [17], [18], we approach single-image super-resolution of CT and MRI scans using deep convolutional neural networks

6

two evaluation metrics, namely the peak signal-to-noise ratio(PSNR) and the structural similarity index (SSIM). The PSNRis the ratio between the maximum possible power of a signaland the power of corrupting noise that affects the fidelity ofits representation. Although the PSNR is one of the mostused metrics for image reconstruction, some researchers [34],[35] argued that it is not highly indicative of the perceivedsimilarity. The SSIM [34] aims to address this shortcomingby taking contrast, luminance and texture into account. Theresult of the SSIM is a number between -1 and 1, where avalue of 1 means the ground-truth image and the reconstructedimage are identical. Similarly, a higher PSNR value indicatesa better reconstruction, although the PSNR does not have anupper bound.

2) Human Evaluation: Because the above metrics rely onlyon the pixel values and not on the perceived visual quality,we decided to evaluate our method using human annotators.Although a deep learning method can provide better PSNRand SSIM values, it might produce artifacts that could bemisleading for right diagnostics and treatment. We thus haveto make sure that our approach does not produce any unwantedartifacts visible to humans. We conducted the human evalua-tion study on the CH data set, testing our CNN-based methodagainst Lanczos interpolation. We used CT slices extractedfrom high-resolution 3D images resulted after applying super-resolution on all three axes. For each upsampling factor, 2×and 4×, we extracted 100 CT slices at random from the testset. Hence, each human evaluator had to annotate 200 imagepairs (100 for each upsampling factor). For each evaluationsample, an annotator would see the original image in themiddle and the two reconstructed images on its sides, oneon the left side and the other on the right side. The annotatorshad a magnifying glass tool that allowed them to look atdetails and discover artifacts. The locations (left or right)of the images reconstructed by our CNN and by Lanczosinterpolation were randomly picked every time. To preventany form of cheating, the randomly picked locations wereunknown to the annotators. For each test sample, we askedeach annotator to select the image that best reconstructed theoriginal image. Our experiment was completed by 18 humanannotators, 6 of them being doctors specialized in radiotherapyand oncology. In total, we collected 3600 annotations (18annotators × 200 samples).

3) Baselines: We compare our method with standard resiz-ing methods based on various interpolation schemes, namelynearest neighbors, bilinear, bicubic and Lanczos. In additionto these baselines, we compare with two methods [3], [17]that focused on 2D SISR and one method [13] that focusedon 3D SISR.

C. Parameter Tuning and Preliminary Results

We conduct a series of preliminary experiments to determinethe optimal patch size as well as the optimal width (numberof convolutional filters) for our CNN. In order to find theoptimal patch size, we tried out patches of 4× 4, 7× 7, 14×14 and 17 × 17 pixels. In term of the number of filters, wetried out values in the set {32, 64, 128} for all conv layers in

TABLE IPRELIMINARY 2D SUPER-RESOLUTION RESULTS ON THE CH DATA SET

FOR AN UPSCALING FACTOR OF 2×. THE PSNR AND THE SSIM VALUESARE REPORTED FOR VARIOUS PATCH SIZES AND DIFFERENT NUMBERS OFFILTERS. FOR MODELS WITH 7× 7 PATCHES, WE REPORT THE INFERENCETIME (IN SECONDS) PER CT SLICE MEASURED ON AN NVIDIA GEFORCE

940MX GPU WITH 2GB OF RAM.

Number of filters Input size SSIM PSNR Time (in seconds)32 4× 4 0.9165 35.36 -32 7× 7 0.9270 36.22 0.0532 14× 14 0.8987 33.83 -32 17× 17 0.8934 33.42 -64 7× 7 0.9279 36.28 0.08128 7× 7 0.9276 36.28 0.16

our network. These parameters were tuned in the context of2D super-resolution on the CH data set. The correspondingresults are presented in Table I. First of all, we note that ourmethod produces better SSIM and PSNR values, i.e. 0.9270and 36.22, for patches of 7×7 pixels. Second of all, we observethat adding more filters on the conv layers slightly increasesthe SSIM and PSNR values. However, the gains in terms ofSSIM and PSNR come with a great cost in terms of time.For example, using 128 filters on each conv layer triples theprocessing time in comparison with using 32 filters on eachconv layer. For the subsequent experiments, we thus opt forpatches of 7× 7 pixels and conv layers with 32 filters.

We believe that it is important to note that, although thenumber of training slices is typically in the range of a fewhundreds, the number of training patches is typically in therange of hundreds of thousands. For instance, the number of7× 7 training patches extracted from the CH data set for the2× upscaling factor is 326.000. We thus stress out that thenumber of training samples is high enough to train highly-accurate deep learning models.

During training, we used mini-batches of 128 imagesthroughout all the experiments. In a set of preliminary exper-iments, we did not observe any significant differences whenusing mini-batches of 64 or 256 images. In each experiment,we train the CNN for 40 epochs, starting with a learning rateof 10−3 and decreasing the learning rate to 10−4 after the first20 epochs.

D. Ablation Study Results

We perform an ablation study to emphasize the effect ofvarious components over the overall performance. The ablationresults obtained on the CH data set for super-resolution onheight and width by a factor of 2× are presented in Table II.

In our first ablation experiment, we have eliminated all theenhancements in order to show the performance level of abaseline CNN on the CH data set. Since there are severalSISR works [3], [6], [16], [18] based on the standard ESPCNmodel [19], we have eliminated the second convolutional blockin the second ablation experiment, transforming our architec-ture into a standard ESPCN architecture. The performancedrops from 0.9270 to 0.9236 in terms of SSIM and from36.22 to 35.94 in terms of PSNR. In the subsequent ablationexperiments, we have removed, in turns, the intermediate loss,the short-skip connections and the long-skip connection. The

Page 7: Convolutional Neural Networks with Intermediate Loss for ... · [16], [17], [18], we approach single-image super-resolution of CT and MRI scans using deep convolutional neural networks

7

TABLE IIABLATION 2D SUPER-RESOLUTION RESULTS ON THE CH DATA SET FOR AN UPSCALING FACTOR OF 2×. THE PSNR AND THE SSIM VALUES ARE

REPORTED VARIOUS ABLATED VERSIONS OF OUR CNN MODEL. THE BEST RESULTS ARE HIGHLIGHTED IN BOLD.

Second conv block Intermediate loss Short-skip connections Long-skip connection Variable standard deviation SSIM PSNR7 7 7 7 7 0.9224 35.587 7 3 3 3 0.9236 35.943 7 3 3 3 0.9256 36.153 3 7 3 3 0.9260 36.173 3 3 7 3 0.9234 36.113 3 3 3 7 0.9236 35.693 3 3 3 3 0.9270 36.22

TABLE III2D SUPER-RESOLUTION RESULTS OF OUR CNN MODEL VERSUS SEVERALINTERPOLATION BASELINES ON THE CH DATA SET. THE PSNR AND THESSIM VALUES ARE REPORTED FOR TWO UPSCALING FACTORS, 2× AND4×. THE BEST RESULT ON EACH COLUMN IS HIGHLIGHTED IN BOLD.

Method 2× 4×SSIM PSNR SSIM PSNR

Nearest neighbor 0.8893 32.71 0.7659 29.06Bilinear 0.8835 33.34 0.7725 29.73Bicubic 0.9077 34.71 0.7965 30.41Lanczos 0.9111 35.08 0.8012 30.57Our CNN model 0.9291 36.39 0.8308 31.59

TABLE IV3D SUPER-RESOLUTION RESULTS OF OUR CNN MODEL VERSUS SEVERALINTERPOLATION BASELINES ON THE CH DATA SET. THE PSNR AND THESSIM VALUES ARE REPORTED FOR TWO UPSCALING FACTORS, 2× AND4×. THE BEST RESULT ON EACH COLUMN IS HIGHLIGHTED IN BOLD.

Method 2× 4×SSIM PSNR SSIM PSNR

Nearest neighbor 0.8430 30.36 0.7152 27.32Bilinear 0.8329 30.72 0.7206 27.93Bicubic 0.8335 26.51 0.7200 24.05Lanczos 0.8423 27.85 0.7263 25.06Our CNN model 0.8926 33.04 0.7819 29.36

results presented in Table II indicate that all these componentsare relevant to our model, bringing performance benefits interms of both SSIM and PSNR. In our last ablation experiment,we used a fixed standard deviation instead of a variable onefor the Gaussian blur added on training patches. We notice thatour data augmentation approach based on a variable standarddeviation brings the highest gains in terms of SSIM (from0.9236 to 0.9270) and PSNR (from 35.69 to 36.22), withrespect to the other ablated components. Overall, the ablationstudy indicates that all the proposed enhancements are useful.

E. Results on CH Data Set

We first compare our CNN-based model with a series ofinterpolation baselines on the CH data set. We present theresults for super-resolution on two axes (width and height)in Table III. Among the considered baselines, it seems thatthe Lanczos interpolation method provides better results thanthe bicubic, the bilinear or the nearest neighbor methods. OurCNN model is able to surpass all baselines for both upscalingfactors, 2× and 4×. Compared to the best interpolation method(Lanczos), our method is 0.0180 better in terms of SSIM and1.31 better in terms of PSNR.

We note that, in Table II, we reported an SSIM of 0.9270and a PSNR of 36.22 for our method, while in Table III,

we reported an SSIM of 0.9291 and a PSNR of 36.39. Inorder to boost the performance of our method accordingto the observed differences between Tables II and III, weemployed the self-ensemble strategy used by Lim et al. [36].For each input image, the self-ensemble strategy consists ingenerating additional images using geometric transformations,e.g. rotations and flips. Following Lim et al. [36], we generate7 augmented images from the LR input image, upsampling all8 images (the original image and the 7 additional ones) usingour CNN. We then apply the inverse transformations to theresulting 8 HR images in order to obtain 8 output images thatare aligned with the ground-truth HR images. The final outputimage is obtained by taking the median of the HR images. Inthe following experiments on CH and NAMIC data sets, thereported results always include the described self-ensemblestrategy.

We provide the results for super-resolution on all three axesin Table IV. First of all, we notice that the SSIM and thePSNR values are lower for all methods when dealing with3D super-resolution (Table IV) instead of 2D super-resolution(Table III). This shows that the task of 3D super-resolutionis much harder than 2D super-resolution, as expected. Never-theless, our method exhibits smaller performance drops whengoing from 2D super-resolution to 3D super-resolution. As forthe 2D super-resolution experiments on CH data set, our CNNmodel for 3D super-resolution is superior to all baselines forboth upscaling factors. We thus conclude that our CNN modelis better than all interpolation baselines on the CH data set, forboth 2D and 3D super-resolution and for all upscaling factors.

F. Results on NAMIC Data Set

On the NAMIC data set, we compare our method withthe best-performing interpolation method on the CH data set,namely Lanczos interpolation, as well as the state-of-the-art methods [3], [13], [17] reporting results on NAMIC. Wepresent the corresponding 2D and 3D super-resolution resultsin Table V.

We note that most previous works, including [3], [13], [17],used bicubic interpolation as a relevant baseline. Unlike theseworks, we opted for Lanczos interpolation, which providedbetter results than bicubic interpolation and other interpolationmethods on the CH data set. The results presented in Table Vindicate that not all state-of-the-art methods attain better per-formance than Lanczos interpolation. This proves that Lanczosinterpolation is a much stronger baseline.

While the state-of-the-art methods [3], [13], [17] presentedresults only for some cases, either 2D super-resolution on T1w

Page 8: Convolutional Neural Networks with Intermediate Loss for ... · [16], [17], [18], we approach single-image super-resolution of CT and MRI scans using deep convolutional neural networks

8

TABLE V2D AND 3D SUPER-RESOLUTION RESULTS OF OUR CNN MODEL VERSUS SEVERAL STATE-OF-THE-ART METHODS [3], [13], [17] AND THE LANCZOS

INTERPOLATION BASELINE ON THE NAMIC DATA SET. FOR ZENG ET AL. [17], WE INCLUDED RESULTS FOR BOTH SINGLE-CHANNELSUPER-RESOLUTION (SCSR) AND MULTI-CHANNEL SUPER-RESOLUTION (MCSR). THE PSNR AND THE SSIM VALUES ARE REPORTED FOR BOTH T1W

AND T2W IMAGES AND FOR TWO UPSCALING FACTORS, 2× AND 4×. THE BEST RESULTS ON EACH COLUMN ARE HIGHLIGHTED IN BOLD.

2D super-resolution 3D super-resolutionMethod T1-weighted T2-weighted T1-weighted T2-weighted

2× 4× 2× 4× 2× 4× 2× 4×SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR

Lanczos 0.9635 37.03 0.8955 32.71 0.9788 39.64 0.9143 33.80 0.9423 35.72 0.8690 31.81 0.9615 37.80 0.8829 32.08Zeng et al. [17] (SCSR) 0.9220 36.86 0.7120 28.33 − − − − − − − − − − − −Zeng et al. [17] (MCSR) 0.9450 38.32 0.8110 30.84 − − − − − − − − − − − −Du et al. [3] 0.9390 37.21 0.7370 29.05 − − − − − − − − − − − −Pham et al. [13] − − − − − − − − − − − − 0.9781 38.28 − −Our CNN model 0.9775 39.29 0.9193 33.74 0.9882 42.20 0.9382 34.86 0.9687 37.85 0.9050 32.88 0.9835 40.57 0.9251 33.54

Fig. 6. Image super-resolution examples selected from the NAMIC data set. In order to obtain the input images of 128 × 128 pixels, the original NAMICimages were downsampled by a scale factor of 2×. HR images of 256×256 pixels generated by Lanczos interpolation and by our CNN model are comparedwith the original (ground-truth) images.

images or 3D super-resolution on T2w images, we provideour results for all possible cases on the NAMIC data set. Wenote that our CNN model surpasses Lanczos interpolation ineach and every case. Furthermore, our model provides superiorresults than all the state-of-the-art methods [3], [13], [17]reporting results on the NAMIC data set.

In addition to the quantitative results shown in Table V,we present qualitative results in Figure 6. We selected 5examples of 2D super-resolution results generated by Lanczosinterpolation and by our CNN model. A close inspectionreveals that our results are generally sharper than those ofLanczos interpolation. As also confirmed by the SSIM andPSNR values presented in Table V, the images generated

by our CNN are closer to the ground-truth images. At thescale factor of 2× considered in Figure 6, our CNN does notproduce any patterns or artifacts that deviate from the ground-truth.

G. Human Evaluation Results

We provide the outcome of the human evaluation studyin Table VI. The study reveals that both doctors and regularannotators opted for our approach in favor of Lanczos inter-polation at an overwhelming rate (97.55% at the 2× scalefactor and 96.69% at the 4× scale factor). For the 2× scalefactor, 10 out of 18 annotators preferred the output of ourCNN in all the 100 presented cases. We note that doctors #2

Page 9: Convolutional Neural Networks with Intermediate Loss for ... · [16], [17], [18], we approach single-image super-resolution of CT and MRI scans using deep convolutional neural networks

9

TABLE VIHUMAN EVALUATION RESULTS COLLECTED FROM 6 DOCTORS AND 12

REGULAR ANNOTATORS, FOR THE COMPARISON BETWEEN OURCNN-BASED METHOD VERSUS LANCZOS INTERPOLATION. FOR EACH

UPSCALING FACTOR, EACH ANNOTATOR HAD TO SELECT AN OPTION FORA NUMBER OF 100 IMAGE PAIRS. TO PREVENT CHEATING, THE

RANDOMLY PICKED LOCATIONS (LEFT OR RIGHT) FOR THE GENERATEDHR IMAGES WERE UNKNOWN TO THE ANNOTATORS.

Annotator ID 2× 4×Our CNN Lanczos Our CNN Lanczos

Doctor #1 100 0 100 0Doctor #2 85 15 100 0Doctor #3 100 0 100 0Doctor #4 86 14 71 29Doctor #5 100 0 68 32Doctor #6 95 5 95 5Person #1 100 0 99 1Person #2 97 3 100 0Person #3 98 2 100 0Person #4 98 2 100 0Person #5 100 0 98 2Person #6 100 0 76 24Person #7 100 0 93 7Person #8 99 1 98 2Person #9 100 0 98 2Person #10 100 0 98 2Person #11 98 2 99 1Person #12 100 0 98 2Overall in % 97.55% 2.45% 96.69% 3.31%

and #4 opted for Lanczos interpolation in 15 and 14 cases(for the 2× scale factor), respectively, which was not typicalto the other annotators. Similarly, for the 4× scale factor,there are 3 annotators (doctor #4, doctor #5 and person #6)that seem to prefer Lanczos interpolation at a higher ratethan the other annotators. After discussing with the doctorsabout their choices, we discovered that, in most cases, theyprefer the sharper output of our CNN. However, the CNNseems to also introduce some reconstruction patterns (learnedfrom training data) that do not correspond to the ground-truth.This phenomenon seems to be more prevalent at the 4× scalefactor. This explains why doctors #4 and #5 preferred Lanczosinterpolation in more cases than the other doctors. Theyconsidered that it is safer to opt for Lanczos interpolation,although the output is blurred and less informative.

Based on our human evaluation study, we concluded withthe doctors that going beyond the 4× scale factor, solely with amethod based on algorithmic super-resolution, is neither safe(a CNN might introduce too many patterns unrelated to theground-truth) nor helpful (a standard interpolation method isnot informative). Therefore, in order to reach the scale factor of10× desired by the doctors, we have to look in other directionsin future work. A promising direction is to combine multipleinputs, e.g. by using CT and MRI scans of the same personor by using CT/MRI scans taken at different moments in time(before and after the contrast agent is introduced).

V. CONCLUSION

In this paper, we have presented an approach based onfully-convolutional neural networks for the super-resolutionof CT/MRI scans. Our method is able to reliably upscale3D CT/MRI image up to a scale factor of 4×. We havecompared our approach with several baseline interpolation and

state-of-the-art methods [3], [13], [17]. The empirical resultsindicated that our approach provides superior results on bothCH and NAMIC data sets. We have also conducted a humanevaluation study, showing that our method is significantlybetter than Lanczos interpolation. The human evaluation studyalso revealed the limitations of a pure algorithmic approach.The doctors invited in our study concluded that going to ascale factor higher than 4× requires alternative solutions. Infuture work, we aim to continue our research by extendingthe proposed CNN method to multi-channel input. This willlikely help us in achieving higher upscaling factors, e.g.10×, required for the accurate diagnostics and treatment ofcancer, an actively studied and extremely important researchtopic [37], [38].

REFERENCES

[1] E. Sert, F. Ozyurt, and A. Dogantekin, “A new approach for braintumor diagnosis system: single image super resolution based maximumfuzzy entropy segmentation and convolutional neural network,” MedicalHypotheses, vol. 133, p. 109413, 2019.

[2] Y. Chen, F. Shi, A. G. Christodoulou, Y. Xie, Z. Zhou, and D. Li, “Ef-ficient and accurate MRI super-resolution using a generative adversarialnetwork and 3D multi-level densely connected network,” in Proceedingsof MICCAI, 2018, pp. 91–99.

[3] X. Du and Y. He, “Gradient-Guided Convolutional Neural Network forMRI Image Super-Resolution,” Applied Sciences, vol. 9, no. 22, p. 4874,2019.

[4] J. Du, L. Wang, A. Gholipour, Z. He, and Y. Jia, “Accelerated Super-resolution MR Image Reconstruction via a 3D Densely Connected DeepConvolutional Neural Network,” in Proceedings of BIBM, 2018, pp.349–355.

[5] J. Du, Z. He, L. Wang, A. Gholipour, Z. Zhou, D. Chen, and Y. Jia,“Super-resolution reconstruction of single anisotropic 3D MR imagesusing residual convolutional neural network,” Neurocomputing, 2019.

[6] J. Hatvani, A. Horvath, J. Michetti, A. Basarab, D. Kouame, andM. Gyongy, “Deep learning-based super-resolution applied to dentalcomputed tomography,” IEEE Transactions on Radiation and PlasmaMedical Sciences, vol. 3, no. 2, pp. 120–128, 2018.

[7] J. Hatvani, A. Basarab, J.-Y. Tourneret, M. Gyongy, and D. Kouame, “ATensor Factorization Method for 3-D Super Resolution With Applicationto Dental CT,” IEEE Transactions on Medical Imaging, vol. 38, no. 6,pp. 1524–1531, 2018.

[8] Y. Huang, L. Shao, and A. F. Frangi, “Simultaneous Super-Resolutionand Cross-Modality Synthesis of 3D Medical Images Using Weakly-Supervised Joint Convolutional Sparse Coding,” in Proceedings ofCVPR, 2017, pp. 5787–5796.

[9] J. Jurek, M. Kocinski, A. Materka, M. Elgalal, and A. Majos, “Cnn-based superresolution reconstruction of 3d mr images using thick-slicescans,” Biocybernetics and Biomedical Engineering, vol. 40, no. 1, pp.111–125, 2019.

[10] Y. Li, B. Song, J. Guo, X. Du, and M. Guizani, “Super-Resolutionof Brain MRI Images Using Overcomplete Dictionaries and NonlocalSimilarity,” IEEE Access, vol. 7, pp. 25 897–25 907, 2019.

[11] D. Mahapatra, B. Bozorgtabar, and R. Garnavi, “Image super-resolutionusing progressive generative adversarial networks for medical imageanalysis,” Computerized Medical Imaging and Graphics, vol. 71, pp.30–39, 2019.

[12] O. Oktay, W. Bai, M. Lee, R. Guerrero, K. Kamnitsas, J. Caballero,A. de Marvao, S. Cook, D. O’Regan, and D. Rueckert, “Multi-input Car-diac Image Super-Resolution Using Convolutional Neural Networks,” inProceedings of MICCAI, 2016, pp. 246–254.

[13] C.-H. Pham, C. Tor-Dez, H. Meunier, N. Bednarek, R. Fablet, N. Passat,and F. Rousseau, “Multiscale brain MRI super-resolution using deep 3Dconvolutional networks,” Computerized Medical Imaging and Graphics,vol. 77, no. 101647, 2019.

[14] J. Shi, Z. Li, S. Ying, C. Wang, Q. Liu, Q. Zhang, and P. Yan, “MR imagesuper-resolution via wide residual networks with fixed skip connection,”IEEE Journal of Biomedical and Health Informatics, vol. 23, no. 3, pp.1129–1140, 2018.

Page 10: Convolutional Neural Networks with Intermediate Loss for ... · [16], [17], [18], we approach single-image super-resolution of CT and MRI scans using deep convolutional neural networks

10

[15] C. You, G. Li, Y. Zhang, X. Zhang, H. Shan, M. Li, S. Ju, Z. Zhao,Z. Zhang, W. Cong et al., “CT super-resolution GAN constrained bythe identical, residual, and cycle learning ensemble (GAN-CIRCLE),”IEEE Transactions on Medical Imaging, 2019.

[16] H. Yu, D. Liu, H. Shi, H. Yu, Z. Wang, X. Wang, B. Cross, M. Bramler,and T. S. Huang, “Computed tomography super-resolution using convo-lutional neural networks,” in Proceedings of ICIP, 2017, pp. 3944–3948.

[17] K. Zeng, H. Zheng, C. Cai, Y. Yang, K. Zhang, and Z. Chen, “Simulta-neous single- and multi-contrast super-resolution for brain MRI imagesbased on a convolutional neural network,” Computers in Biology andMedicine, vol. 99, pp. 133–141, 2018.

[18] X. Zhao, Y. Zhang, T. Zhang, and X. Zou, “Channel splitting networkfor single MR image super-resolution,” IEEE Transactions on ImageProcessing, vol. 28, no. 11, pp. 5649–5662, 2019.

[19] W. Shi, J. Caballero, F. Huszar, J. Totz, A. P. Aitken, R. Bishop,D. Rueckert, and Z. Wang, “Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network,”in Proceedings of CVPR, 2016, pp. 1874–1883.

[20] M. Bevilacqua, A. Roumy, C. Guillemot, and M. line Alberi Morel,“Low-Complexity Single-Image Super-Resolution based on NonnegativeNeighbor Embedding,” in Proceedings of BMVC, 2012, pp. 135.1–135.10.

[21] Hong Chang, Dit-Yan Yeung, and Yimin Xiong, “Super-resolutionthrough neighbor embedding,” in Proceedings of CVPR, vol. 1, 2004,pp. I–I.

[22] D. Dai, R. Timofte, and L. V. Gool, “Jointly Optimized Regressors forImage Super-resolution,” Computer Graphics Forum, vol. 34, pp. 95–104, 2015.

[23] J. Yang, J. Wright, T. S. Huang, and Y. Ma, “Image Super-ResolutionVia Sparse Representation,” IEEE Transactions on Image Processing,vol. 19, no. 11, pp. 2861–2873, 2010.

[24] C. Dong, C. C. Loy, K. He, and X. Tang, “Image Super-Resolution UsingDeep Convolutional Networks,” IEEE Transactions on Pattern Analysisand Machine Intelligence, vol. 38, no. 2, pp. 295–307, 2016.

[25] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu, “Image Super-Resolution Using Very Deep Residual Channel Attention Networks,” inProceedings of ECCV, 2018, pp. 294–310.

[26] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” inProceedings of NIPS, 2014, pp. 2672–2680.

[27] J. C. Elliott and S. D. Dover, “X-ray microtomography,” Journal ofMicroscopy, vol. 126, no. 2, pp. 211–213, 1982.

[28] C. Enders, E.-M. Braig, K. Scherer, J. U. Werner, G. K. Lang, G. E.Lang, F. Pfeiffer, P. Nol, E. Rummeny, and J. Herzen, “Advanced Non-Destructive Ocular Visualization Methods by Improved X-Ray ImagingTechniques,” PLoS One, vol. 12, p. e0170633, 2017.

[29] G. Tromba, R. Longo, A. Abrami, F. Arfelli, A. Astolfo, P. Bregant,F. Brun, K. Casarin, V. Chenda, D. Dreossi, M. Hola, J. Kaiser,L. Mancini, R. H. Menk, E. Quai, E. Quaia, L. Rigon, T. Rokvic,N. Sodini, D. Sanabor, E. Schultke, M. Tonutti, A. Vascotto, F. Zan-conati, M. Cova, and E. Castelli, “The SYRMEP Beamline of Elettra:Clinical Mammography and Biomedical Applications,” AIP ConferenceProceedings, vol. 1266, no. 1, pp. 18–23, 2010.

[30] C. H. McCollough, S. Leng, L. Yu, and J. G. Fletcher, “Dual- andMulti-Energy CT: Principles, Technical Approaches, and Clinical Ap-plications,” Radiology, vol. 276, no. 3, pp. 637–653, 2015.

[31] J. Doerner, M. Hauger, T. Hickethier, J. Byrtus, C. Wybranski, N. G.Hokamp, D. Maintz, and S. Haneder, “Image quality evaluation of dual-layer spectral detector CT of the chest and comparison with conventionalCT imaging,” European Journal of Radiology, vol. 93, pp. 52–58, 2017.

[32] V. Nair and G. E. Hinton, “Rectified Linear Units Improve RestrictedBoltzmann Machines,” in Proceedings of ICML, 2010, pp. 807–814.

[33] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”in Proceedings of ICLR, 2015.

[34] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Imagequality assessment: from error visibility to structural similarity,” IEEETransactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004.

[35] Z. Wang and A. C. Bovik, “Mean squared error: Love it or leave it?A new look at signal fidelity measures,” Signal Processing Magazine,vol. 26, no. 1, pp. 98–117, 2009.

[36] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee, “Enhanced DeepResidual Networks for Single Image Super-Resolution,” in Proceedingsof CVPRW, 2017, pp. 1132–1140.

[37] C. Popa, N. Verga, M. Patachia, S. Banita, and C. Matei, “Advantagesof laser photoacoustic spectroscopy in radiotherapy characterization,”Romanian Reports in Physics, vol. 66, pp. 120–126, 2014.

[38] D. Sardari and N. Verga, “Calculation of externally applied electric fieldintensity for disruption of cancer cell proliferation,” ElectromagneticBiology and Medicine, vol. 29, no. 1–12, pp. 26–30, 2010.

Mariana-Iuliana Georgescu is a Ph.D. student atthe Faculty of Mathematics and Computer Science,University of Bucharest. She received the B.Sc. de-gree from the Faculty of Mathematics and ComputerScience, University of Bucharest, in 2017 and theM.Sc. degree in Artificial Intelligence from the sameuniversity in 2019. Although she is early in her re-search career, she is coauthor of a paper on abnormalevent detection in video, accepted at CVPR 2019,and the first author of 3 papers published at con-ferences and journals. Her research interests include

artificial intelligence, computer vision, machine learning, deep learning andmedical image processing.

Radu Tudor Ionescu is Full Professor at the Univer-sity of Bucharest, Romania. He graduated from theFaculty of Mathematics and Computer Science ofthe University of Bucharest in 2009, and he obtaineda M.Sc. diploma in Artificial Intelligence as valedic-torian from the same university. He completed hisPh.D. at the University of Bucharest in 2013. Radureceived the 2014 Award for Outstanding DoctoralResearch in the field of Computer Science fromthe Romanian Ad Astra Association. His researchinterests include machine learning, computer vision,

image processing, text mining and computational biology. He publishedover 60 articles at international peer-reviewed conferences and journals, anda research monograph with Springer. Radu also received the ”Young Re-searchers in Science and Engineering” Prize organized by prof. Rada Mihalceafor young Romanian researchers in all scientific fields. He participated atseveral international competitions obtaining top ranks: 4th place in the FacialExpression Recognition Challenge of the WREPL Workshop of ICML 2013,3rd place in the Native Language Identification Shared Task of the BEA-8Workshop of NAACL 2013, 2nd place in the Arabic Dialect IdentificationShared Task of the VarDial Workshop of COLING 2016, 1st place in theArabic Dialect Identification Shared Task of the VarDial Workshop of EACL2017, 1st place in the Native Language Identification Shared Task of theBEA-12 Workshop of EMNLP 2017, and 1st place in the Arabic DialectIdentification Shared Task of the VarDial Workshop of COLING 2018.

Nicolae Verga is senior doctor in radiotherapy,senior doctor in radiology, head of the discipline ofOncological Radiotherapy and Medical Imagistics atthe “Carol Davila” University of Medicine and Phar-macy of Bucharest. During his professional career hehas published several articles and communications,in which he addresses the problem of different typesof radiological image processing in order to improvethe diagnosis, simulation and planning in the treat-ment of different malignant and benign tumors. Indaily medical, didactic and research practice at Colea

Hospital, Nicolae Verga evaluates the imagistic investigations of the patientsand correlates them with the information provided by the anatomopathologi-cal, clinical and paraclinical examinations of the patients. The average numberof cases is about 450 per year. Participation in the weekly activity of theCommission for Surgical Therapeutic Indications increases the number anddiversity of cases.