a novel approach to classifying breast cancer

11
Research Article A Novel Approach to Classifying Breast Cancer Histopathology Biopsy Images Using Bilateral Knowledge Distillation and Label Smoothing Regularization Sushovan Chaudhury , 1 Nilesh Shelke , 2 Kartik Sau , 1 B. Prasanalakshmi , 3 and Mohammad Shabaz 4 1 University of Engineering and Management, Kolkata, India 2 Priyadarshini Indira Gandhi College of Engineering, Nagpur, India 3 Department of Computer Science, King Khalid University, Abha, Saudi Arabia 4 Arba Minch University, Ethiopia Correspondence should be addressed to B. Prasanalakshmi; [email protected] Received 3 September 2021; Revised 1 October 2021; Accepted 1 October 2021; Published 20 October 2021 Academic Editor: Deepika Koundal Copyright © 2021 Sushovan Chaudhury et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Breast cancer is the most common invasive cancer in women and the second main cause of cancer death in females, which can be classied benign or malignant. Research and prevention on breast cancer have attracted more concern of researchers in recent years. On the other hand, the development of data mining methods provides an eective way to extract more useful information from complex databases, and some prediction, classication, and clustering can be made according to the extracted information. The generic notion of knowledge distillation is that a network of higher capacity acts as a teacher and a network of lower capacity acts as a student. There are dierent pipelines of knowledge distillation known. However, previous work on knowledge distillation using label smoothing regularization produces experiments and results that break this general notion and prove that knowledge distillation also works when a student model distils a teacher model, i.e., reverse knowledge distillation. Not only this, but it is also proved that a poorly trained teacher model trains a student model to reach equivalent results. Building on the ideas from those works, we propose a novel bilateral knowledge distillation regime that enables multiple interactions between teacher and student models, i.e., teaching and distilling each other, eventually improving each others performance and evaluating our results on BACH histopathology image dataset on breast cancer. The pretrained ResNeXt29 and MobileNetV2 models which are already tested on ImageNet dataset are used for transfer learningin our dataset, and we obtain a nal accuracy of more than 96% using this novel approach of bilateral KD. 1. Introduction Breast cancer is the most lethal type of tumour in women worldwide, second only to lung cancer in terms of preva- lence. In the most common type of breast cancer, which is popularly known as invasive ductal carcinoma (IDC), the cancer develops in the milk ducts of the breast and then rap- idly spreads to involve the surrounding structures. Invasive ductal carcinoma is so progressive that it can even inltrate the lymph nodes and the circulatory system, which leads to the cancer spreading to various parts of the body. Even though invasive or inltrating ductal carcinoma may be rapidly progressive, but it is still a treatable condition at least when identied in its initial stages [1]. Machine learning is a branch of articial intelligence that employs a variety of statistical, probabilistic, and optimization techniques that allow computers to learnfrom past examples and to detect hard-to-discern patterns from large, noisy, or complex data- sets. As a result, machine learning is frequently used in cancer diagnosis and detection. Machine learning and deep learning approaches may help radiologists to a large extent by helping to detect breast cancer at an early stage by Hindawi Computational and Mathematical Methods in Medicine Volume 2021, Article ID 4019358, 11 pages https://doi.org/10.1155/2021/4019358

Upload: others

Post on 10-Dec-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Research ArticleA Novel Approach to Classifying Breast Cancer HistopathologyBiopsy Images Using Bilateral Knowledge Distillation and LabelSmoothing Regularization

Sushovan Chaudhury ,1 Nilesh Shelke ,2 Kartik Sau ,1 B. Prasanalakshmi ,3

and Mohammad Shabaz 4

1University of Engineering and Management, Kolkata, India2Priyadarshini Indira Gandhi College of Engineering, Nagpur, India3Department of Computer Science, King Khalid University, Abha, Saudi Arabia4Arba Minch University, Ethiopia

Correspondence should be addressed to B. Prasanalakshmi; [email protected]

Received 3 September 2021; Revised 1 October 2021; Accepted 1 October 2021; Published 20 October 2021

Academic Editor: Deepika Koundal

Copyright © 2021 Sushovan Chaudhury et al. This is an open access article distributed under the Creative Commons AttributionLicense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work isproperly cited.

Breast cancer is the most common invasive cancer in women and the second main cause of cancer death in females, which can beclassified benign or malignant. Research and prevention on breast cancer have attracted more concern of researchers in recentyears. On the other hand, the development of data mining methods provides an effective way to extract more usefulinformation from complex databases, and some prediction, classification, and clustering can be made according to theextracted information. The generic notion of knowledge distillation is that a network of higher capacity acts as a teacher and anetwork of lower capacity acts as a student. There are different pipelines of knowledge distillation known. However, previouswork on knowledge distillation using label smoothing regularization produces experiments and results that break this generalnotion and prove that knowledge distillation also works when a student model distils a teacher model, i.e., reverse knowledgedistillation. Not only this, but it is also proved that a poorly trained teacher model trains a student model to reach equivalentresults. Building on the ideas from those works, we propose a novel bilateral knowledge distillation regime that enablesmultiple interactions between teacher and student models, i.e., teaching and distilling each other, eventually improving eachother’s performance and evaluating our results on BACH histopathology image dataset on breast cancer. The pretrainedResNeXt29 and MobileNetV2 models which are already tested on ImageNet dataset are used for “transfer learning” in ourdataset, and we obtain a final accuracy of more than 96% using this novel approach of bilateral KD.

1. Introduction

Breast cancer is the most lethal type of tumour in womenworldwide, second only to lung cancer in terms of preva-lence. In the most common type of breast cancer, which ispopularly known as invasive ductal carcinoma (IDC), thecancer develops in the milk ducts of the breast and then rap-idly spreads to involve the surrounding structures. Invasiveductal carcinoma is so progressive that it can even infiltratethe lymph nodes and the circulatory system, which leads tothe cancer spreading to various parts of the body. Even

though invasive or infiltrating ductal carcinoma may berapidly progressive, but it is still a treatable condition at leastwhen identified in its initial stages [1]. Machine learning is abranch of artificial intelligence that employs a variety ofstatistical, probabilistic, and optimization techniques thatallow computers to “learn” from past examples and to detecthard-to-discern patterns from large, noisy, or complex data-sets. As a result, machine learning is frequently used incancer diagnosis and detection. Machine learning and deeplearning approaches may help radiologists to a large extentby helping to detect breast cancer at an early stage by

HindawiComputational and Mathematical Methods in MedicineVolume 2021, Article ID 4019358, 11 pageshttps://doi.org/10.1155/2021/4019358

identifying useful patterns in malignant cells. The learntpatterns can be used to classify the unknown images of thebreast as benign or malignant. Mammography, ultrasonogra-phy, and histological biopsies are some of the numerousoptions for BCD. Mammograms can help to understandcalcification in human breast, whereas studying histologyimages remains a challenge due to its complexity in patternidentification. This study analyzes the gene activities of thesurvival vs. deceased for each therapy, and the potential bio-markers will help to identify the best therapy for the patientsbased on their gene expression test. This model has very highaccuracy levels, and it uses a hierarchical model as a tree thatincludes one-versus-rest classifications. In 2012, there wereroughly 1.7 million new cancer cases (representing 25% ofall malignancies in women) and 0.5 million cancer deaths(representing 15% of all cancer fatalities in women). In 140nations, ductal carcinoma is the most prevalent cancer diag-nosis among females, and it is the primary cause of cancerdeath in 101 nations. According to McGuire [2], histologicalstudy of breast cancers can also reveal the stage of the diseaseor the extent to which the cancer has spread. This is mea-sured in two ways: by the size of the tumour and by assessingthe involvement of regional lymph nodes. Both of thesevariables provide valuable prognostic data for patient care.Smith et al. [3] recommend screening practices for the earlyrecognition of breast cancer. Computer-aided diagnosis ofductal carcinoma can advance the performance of radiolo-gists in several folds as explained by Filipczuk et al. andKowal et al. [4, 5]. The simple idea of the CAD system is totake raw descriptions from different modalities like mammo-gram and histology images and preprocess them. This pre-processing includes amending size, contrast, and brightnessand performing augmentation in case input data which arescarce. The next step involves segmentation to critically iden-tify the regions of interest in the image so that the systemonly learns the relevant part of the image and not the noisepart. The third stage is to extract features from a region ofinterest (ROI), which is usually done using CNN, which han-dles automatic feature extraction by producing feature mapsat each convolution layer and regularising the models withdropouts so that they do not fit [6]. The higher-order featuremaps are flattened, and classification is performed using theclass probabilities generated by a softmax layer. Thesemodels, however, are massive, containing millions (if not bil-lions) of parameters, and so cannot be used on edge devices.The technique is being tested on the BACH dataset, a largehistology dataset for breast cancer detection, to reduce com-putational complexity. The concept is to represent compres-sion by having a larger and more sophisticated pretrainednetwork that teaches a smaller network (students) step bystep (teacher). Indeed, the theoretical search space for morecomplicated models is bigger than that of a smaller system[7, 8]. However, if we suppose that the equivalent (or evencomparative) convergence can be attained using a smallerlink, then the convergence space of the teacher system shouldconnect with the solution space of the student system. Trag-ically, that by itself does not ensure union of the solutionspace for the student network and the teacher network in asimilar area. The convergence of the student network may

be completely different from the convergence of the teachernetwork [9].

The basic steps of knowledge distillation are given asfollows.

(1) Train the teacher network: The complex teachernetwork is first trained using the whole dataset onhigh-performing GPUs

(2) Establish a link of correspondence: The student net-work is then designed and a correspondence needs tobe formed between each output of the simple studentnetwork and the complex teacher network

(3) Forward propagation of knowledge through theteacher network: Feedforward data through theteacher network to get all intermediate outputs;apply augmentation (if any) to the same

(4) Backpropagation through the student network: Theoutput from the teacher and the correspondencerelation are used to backpropagate the error to thestudent. Therefore, the student network learns toimitate the “knowledge” of the teacher

2. Related Work

Koboldt et al. [10] studied key ductal carcinoma by genomicDNA copy arrays, DNA methylation, exome sequencing,messenger RNA arrays, micro-RNA sequencing, andopposite-segment protein arrays. The different classifica-tions of breast cancers are researched in accordance withWHO standards [11]. They outline the methods requiredto examine the histological description of breast cancer,similar to the one explained by Veta et al. [12]. With theintroduction of whole slide imaging (WSI) scanners, whichcan be affordable and high-throughput histopathology slidedigitalization, this study topic has become particularly rele-vant, with the goal of substituting the optical microscopeas the principal tool used by pathologists. Breast cancerclassification from histological photos is of considerabletherapeutic value, according to [13], although it is yet to bestudied. The multiple classification of ductal carcinoma hasa purpose to find the subcategories of breast cancer (ductalcarcinoma, fibroadenoma, lobular carcinoma, etc.). How-ever, multiclassification of breast cancer from histopathol-ogy pictures faces 2 major challenges: (1) sorting of binaryclasses (benign and malignant) and (2) minor variances invarious classes due to the wide diversity of high-resolutionpicture appearances, strong coherency of malignant cells,and large colour distribution inhomogeneity. As per Sharmaet al. [14], special attention was given to histopathologyimages in the BACH dataset. A model of careful consider-ation was proposed where the network emphasizes learningfrom patches and records good accuracy in multilevel classi-fication. Kausar et al. [15] experimented with a number ofdeep convolutional models, using colour normalisation as anormalisation strategy before applying augmentation to theinput data. Praveena Anjelin and Ganesh Kumar [16]experimented with different transfer learning techniques

2 Computational and Mathematical Methods in Medicine

and pretrained models such as MobileNetV2, AlexNet,ResNet, and VGGNet on the BACH dataset. Several effortsin the subject of knowledge distillation have also been com-pleted. Ba and Caruana [17] investigated with the CIFAR10dataset and detected that a large network is capable of train-ing smaller networks in patches that are quite shallow butare capable to match the accuracy of deep neural nets, andsince they need not be necessarily deep, the computationalcomplexity can be reduced. Hinton et al. [18] found startlingfindings on MNIST, demonstrating that condensing theknowledge from an ensemble of models into a single modelcan greatly improve the acoustic model of a widely usedcommercial system [19]. Often the finest outcomes areattained by assembling millions of individual classifiers,which is a computationally expensive task. In [20],researchers offer a strategy for “compressing” big, compli-cated ensembles into smaller, quicker models with little tono performance loss. Dipalma et al. [21] utilised knowledgedistillation with model compression to address challengingproblems in whole slide images, such as histology. Knowl-edge distillation has been used in object detection by Daiet al. [22], face recognition by Ge et al. [23], semanticsegmentation by He et al. [24], classification of chest X-rayabnormalities by Ho and Gwak [25], and to regain land sur-face temperature from MODIS daytime mid-infrared databy Tang and Wang [26]. Knowledge distillation was usedto enhance the computational efficiency of breast cancerdiagnosis by Garg et al. and Thiagarajan et al. [27, 28]. Theystress that in a variety of applications, such as medical imageanalysis, weakly supervised instance labelling utilising sim-ple picture-level labels rather than costly fine-grained pixelannotations are critical. In contrast to traditional computervision instance segmentation, the challenges we addressare characterised by a small number of training photosand nonlocal patterns that lead to diagnosis. Motivated byall of these KD applications, as well as the use of KD andcomputationally efficient neural nets for training large WSIas in BACH images of the breast, we decided to focus onbilateral knowledge distillation, in which the teacher andstudent learn from each other and improve each other’sperformance, as shown in [29–31].

3. Methodology

3.1. Dataset and Preprocessing. For our experiments, we usedICIAR 2018 Grand Challenge on Breast Cancer Histologydataset [32] as described in Figure 1. The dataset is com-prised of haematoxylin and eosin (H&E) stained breastmicroscopic anatomy research and whole slide imageslabelled as traditional, benign, in-place cancer, or invasivecancer. The clarification was performed by 2 clinical special-ists and images with conflict were disposed of. A total of 400microscope images are in the collection, out of which 100each are normal images, benign images, in situ carcinoma,and invasive carcinoma. All microscopy images are on.tiffformat, and their specifications are given below in Table 1.

The BACH dataset comprises of about 400 histologicalmicroscopic images of the breast stained with haematoxylinand eosin (H&E). Conferring to the major cancer type in

each image, these photos are categorized as [N] normal,[B] benign, [I] in situ carcinoma, or invasive carcinoma,with 100 labels each. The dataset was distributed into70 : 30 train-test subgroups for the studies we ran. Becauseof the variances in staining methods, scanner colourresponse, and tissue preparation, the staining process is acomplex process that might result in undesired colour vari-ations among tissue types. We employ a technique calledstructure-preserving colour normalisation with sparse strainseparation since colour variance across pictures can affectperformance. The approach uses one picture as the target,and the other images are normalised by merging their sepa-rate stain density maps with the target image’s stain colourbase, maintaining the morphology. After that, the modifieddataset is scaled between 0 and 1, and each channel is nor-malised to the ImageNet dataset. ImageNet project is a largevisual database used in visual object recognition in softwareresearch and is tested and trained on more than 14 millionimages of various types, both labelled and unlabelled. Weuse online data augmentation techniques to provide varietyto the training dataset, allowing the network to generaliseto new datasets. Because the histological pictures are rota-tionally invariant, horizontal and vertical flips with rotationsranging from -25 to 25 degrees were also included. Asaugmentation methods, random zoom and intensity shiftswere also utilised.

3.2. Knowledge Distillation in Theory. Knowledge distillationrefers to distilling facts from a complex teacher model into arelatively weak and light student model. However, it is canbe better summarized as a label smoothing regularizationtechnique. The outcome of knowledge distillation dependsnot only on the similarity information between teachercategories but also on the regularization of soft goals. Thesoftening regularization of the label refers to the modeltraining by the exchange of one hot label and soft smoothedlabel between the teacher and the student. Loss function oflabel smoothing regularization (LSR) for a network S is asfollows. For every training instance x, S outputs the proba-bility of each label as shown in the following equation:

a ∈ 1::Af g: r a yjð Þ = soft max Tað Þ = exp Tað Þ∑A

f−1 T f

� � , ð1Þ

where S is a neural network to train, T f is the logit of theneural network, and S is given in the following equation.

n′ að Þ = 1 − βð Þn að Þ + βv að Þ, ð2Þ

where n′ðaÞ is the smoothed labelled distribution, is a mix-ture of nðaÞ, and is a fixed distribution vðaÞ with weight β.Now, vðaÞ is uniform distribution as vðaÞ = ½1/A�.

3Computational and Mathematical Methods in Medicine

The smoothed entropy loss Bðn′, rÞ is defined over thesmoothed label as in the following equation.

B n′, r� �

= −〠A

f−1n′ að Þ log r að Þ = 1 − βð ÞB n, rð Þ + βB v, rð Þ

= 1 − βð ÞB n, rð Þ + β CAM v, rð Þ + B vð Þð Þ,ð3Þ

where CAM is the Kullback-Leibler divergence and CðaÞdenotes the entropy of v and is a constant for the fixed uni-form distribution vðaÞ.

The loss function for label smoothing of neural networkS can be written as in the following equations:

MMO = 1 − βð ÞB n, rð Þ + βCAM v, rð Þ, ð4Þ

mAC = 1 − βð ÞB n, rð Þ + βCAM rxπ, rπð Þ, ð5Þwhere Bðn, rÞ is the minimized cross-entropy and CAM is theKullback-Leibler divergence (AM divergence).

The rxπðkÞ in CAMðv, rÞ is a distribution from the teachermodel.

Additionally, CAMðrxπ, rπÞ = Bðrxπ, rπÞ − BðrxπÞ, where BðrxπÞ is a constant entropy. Therefore, Equation (4) canbe reframed as the following equation.

mAC = 1 − βð ÞB n, rð Þ + β CAM rxπ, rπð Þ + B rxπ, rπð Þð Þ= 1 − βð ÞB n, rð Þ + βB rxπ, rπð Þð Þ: ð6Þ

Setting the temperature π = 1 and mAC = Bð~nx , rÞ,where ~nx is as follows:

~nx að Þ = 1 − βð Þn að Þ + βrx að Þ: ð7Þ

3.3. Reverse Knowledge Distillation in Theory. Unlike

normal KD, Re-KD takes teachers’ accuracy as the base-line accuracy, which is much advanced than normal KD.For example, in various cases, Re-KD exceeds the normalKD. The basis of this concept is that the teacher becomesthe student and the student becomes the teacher. Using Re-KD in the BACH dataset for cancer histopathology biopsyimages, we found that Re-KD produces better results.

3.4. Bilateral Knowledge Distillation in Theory. For conven-tional knowledge, distillation of the loss function is mathe-matically defined as follows:

MWAC = δ j, j′� �

+ 〠hεH

ηhρh ESh, EX

h

� �, ð8Þ

where δðj, j′Þ represents loss of task-specific nature, j is theground truth, transformed output of the student layer is rep-resented as ES

h, transformed output of the teacher network isrepresented as EX

h , note that this transformed module is pro-duced from m-th module, ηh is a tunable balance factorwhich balances the unlike losses, ρhð·, · Þ is the m-th distilla-tion loss to reduce the difference between ES

h and EXh .

Consider kf ∈ f0, 1g to be a Bernoulli random variable,and f is the f -th hybrid block in the hybrid network. Assum-ing (kf = 1) as the student path selected and when theteacher path is chosen, it is assumed as (kf = 0). SupposeBf represents the input and Bf+1 depicts the output of thef -th hybrid block; hence, the yield f -th hybrid block canbe represented as follows:

Bf+1 = kf ISf Bf

� �+ 1 − βf

� �IXf Bf

� �: ð9Þ

Here, ISf ð⋅Þ is the functional representation of the student

block and IXf ð⋅Þ is the functional representation of theteacher block.

Normal Benign in situ carcinoma Invasive carcin

Figure 1: The labelled images of each type of cell used in classification.

Table 1: Specifications of microscopy images.

S no. Parameter Details

1 Colour model Red, green, blue

2 Size 2048 Wð Þ × 1536 Lð Þ in pixels

3 Pixel scale 0:42μm× 0:42 μm4 Retention area: 10-20MB (approx.) 10-20MB (approx.)

4 Computational and Mathematical Methods in Medicine

Taking the value of βf to be 1, Equation (2) takes thefollowing form:

Bf+1 = IXf Bf

� �: ð10Þ

This simply means the hybrid block reduces to the orig-inal student block. Whereas if we consider βf = 0, Equation(2) can be formulated as follows:

Bf+1 = IXf Bf

� �: ð11Þ

Now suppose δð:, :Þ is the task-specific loss, j representsthe ground fact~jy denoting the predicted result of the hybridnetwork, and dy is the hybrid network, its optimized lossfunction can be written as follows:

dy = δ j,~jy� �

: ð12Þ

As per Figure 2 and Algorithm 1, we propose a bilateralknowledge distillation regime. In phase 1, the more complexteacher model, the ResNeXt29, is used as a teacher model totrain the lightweight MobileNetV2. In step 2, the teacher andthe student are reversed just being interactive and exchang-

ing knowledge. It is already shown in the literature that theteacher and students can distil knowledge to each other. Inthis novel approach, phase 2 comes into a picture, wherethe learnings of the teacher and student are fed back to eachother and improved accuracy at each phase. We havestopped in phase 3 where knowledge distillation throughMobileNetV2 has produced state-of-the-art accuracy as dis-cussed in our results. This process can be iterated over severalphases until we get the desired accuracy and depending on thecomputational resources available. The idea is further illus-trated in Figure 3 with a schematic description.

4. Results

4.1. Experimental Setup. For our experiments, we use twowell-known transfer learning networks in the ImageNetdataset, ResNeXt29 and MobileNetV2 pretrained models.BACH 2018 dataset of breast histopathology images wastrained on those networks for the training of histopathologyimages. PyTorch (build 1.9.0) with CUDA 10.2 platform wasused for the experiments.

4.2. Baseline Teacher Training. Our baseline trainer trainingon ResNeXt29 (8 × 64 d with 34.53M parameters) andMobileNetV2 (including the preliminary completely

Teacherresnext29

Studentmobilenetv2

Teachermobilenetv2

Studentresnext29

Resnext29baseline training

Phase 1

Phase 2

Phase 3

Teacherresnext29

Studentmobilenetv2

Teacherresnext29

Studentmobilenetv2

Teachermobilenetv2

Studentresnext29

KD–200 epochs

ReKD–200 epochs

KD–200 epochs

KD–200 epochs

ReKD–200 epochs

Figure 2: The proposed framework of knowledge distillation.

Phase 1Step 1 Knowledge distil mobilenetv2 - Teacher model: Resnext29 - train e epochsStep 2 Reverse Knowledge distil Resnext29 - Teacher model: Mobilenetv2 - train e epochsPhase 2Step 1 Knowledge distil mobilenetv2 (phase1) - Teacher model: Resnext29 (phase1) - train e epochsStep 2 Reverse Knowledge distil Resnext29 - Teacher model: Mobilenetv2 - train e epochsPhase 3Step 1 Knowledge distil mobilenetv2 (phase2) - Teacher model: Resnext29 (phase2) - train e epochs

Algorithm 1: Bilateral knowledge distillation (base training: ResNeXt29 for e epochs).

5Computational and Mathematical Methods in Medicine

STEPl

Knowledge distilmobilenetv2-teacher

model: resnext29-train e epochs.

STEP2

Reverse Knowledgedistil resnext29-teacher model:

mobilenetv2-train eepochs

STEPl

Knowledge distilmobilenetv2 (phase1)

-teacher model:resnext29 (phase1)-

train e epochs.

STEP2

Reverse knowledge distil resnext29-teacher model:

mobilenetv2-train eepochs

STEPl

Knowledge distilmobilenetv2 (phase2)

-teacher model:resnext29 (phase2)-

train e epochs

Phase1

Phase2

Phase3

Figure 3: Schematic flow diagram of bilateral KD as used in this paper.

0

40

40 80 120 160 200

60

80

100

Train_accuracytag: Train_accuracy

Train_loss

0 40 80 120 160 200

Train_losstag: Train_loss

0 20 40 60 80 100 120 140 160 180 200

40

50

Test_accuracytag: Test_accuracy

Test_accuracy

Test_loss

1.4

400 80 120 160 200

1.8

2.2

2.6

Test_losstag: Test_loss

0

0.4

0.8

1.2

Figure 4: Train and test set accuracy of MobileNetV2.

6 Computational and Mathematical Methods in Medicine

convolution layer with 32 filters, accompanied by the way of19 residual bottleneck layers) indicates that the validationaccuracy for ResNeXt29 is as excessive as 75 percent whereinas that of MobileNetV2 is 60.71 percent as shown inFigures 4 and 5.

4.3. Tuning Hyperparameters. Several hyperparameters weretuned as follows: Optimizer: SGD, momentum=0.9, weight_decay=5e-4, Learning Rate scheduler: learning rate warm up,Defined phase as n: 3, Define epochs as e: 200, “alpha”: 0.95,“temperature”: 20, “learning_rate”: 0.1, “batch_size”: 8,“dropout_rate”: 0.5, “save_summary_steps”: 100, “num_workers”: 4.

4.4. Normal Knowledge Distillation Results. Since ResNeXt29gave us higher accuracy in validation, we proceed by takingit as a teacher network and taking MobileNetV2 as the ligh-ter student network. As per literature, we confirmed that theteacher model was able to distil the lightweight studentmodel. However, since histological images are computation-ally complex, the results were no more than 60.71 percent onthe validation set. The loss was considerable. The normalKD results are shown in Figure 6.

4.5. Reverse Knowledge Distillation Results. Again, wereversed the teacher and student networks to validate the

reverse knowledge distillation on the given set of histologicalimages. As such, MobileNetV2 now becomes the teacherand ResNeXt29 becomes the student. The results wereimproved as we got a validation accuracy of 65.58 percent.The results which are obtained through these experimentsare shown in Figure 7.

4.6. Bilateral Knowledge Distillation Results. From the resultsof the above two experiments, we infer that the normal KDor the reverse KD is not sufficient to give us the best resultfor computationally complex histological images. Our pro-posed method of bilateral knowledge distillation, whichworks on the notion of both teacher and student, interac-tively learns from each other and thereby improves eachother’s accuracy. After three phases of interactive learningand exchanging information between teachers and thestudent, the MobileNetV2 reaches 96.3 percent accuracy ona validation set, which we claim as a state-of-the-art resultfor histological images of the breast. Results of training andvalidation using bilateral knowledge distillation are shownin Figure 8.

5. Discussion of Results

In bilateral cognitive distillation, we found that students andteachers interact with each other. In noninteractive cognitive

0

45

40 80 120 160 200

55

65

75

85

95

Train_accuracytag: Train_accuracy

Train_loss Test_loss

0 40 80 120 160 200

Train_losstag: Train_loss

0 20 40 60 80 100 120 140 160 180 200

45

35

65

Test_accuracytag: Test_accuracy

55

1

400 80 120 160 200

2

3

4

Test_losstag: Test_loss

5

0

0.4

0.8

1.2

Figure 5: Train and test set accuracy of ResNeXt29.

7Computational and Mathematical Methods in Medicine

distillation such as traditional cognitive distillation, theteacher network sets only one goal to mimic the student net-work, ignoring the communication of the student network.However, since the characteristic transformation potentialof the student network is much lower than that of theteacher, it is impossible to mimic the exact knowledge thatthe teacher has transferred to it. Therefore, there is a differ-ence between the knowledge taken by the student and theknowledge transferred by the teacher, which limits KD per-formance. In interactive, bilateral cognitive distillation, thestudent takes action based on the problem they are facing,then the teacher gives feedback based on the student’sactions, and finally, the student takes action according tothe teacher feedback. In particular, we propose that theteacher and students be interactive and that knowledge betransferred from teacher to student and from student tostudent to teacher until the student who assesses the classesreaches the allowable accuracy. We used two pretrainedmodels trained on ImageNet datasets for our baseline train-ing, namely, ResNeXt29 and MobileNetV2, for validation oftheories of knowledge distillation on breast histologicalimages. The histology biopsy images are too complex andhave a similar colour combination when viewed under amicroscope after staining. As such even with robustnetworks like ResNeXt29 and MobileNetV2, the base line

accuracies were limited. In normal KD, due to higher valida-tion accuracy, ResNeXt29, the more complex model hasbeen chosen as the teacher network and the simple Mobile-NetV2 as the corresponding student network and theteacher and student networks were reversed for Re-KD.After three phases of interactive learning and exchanginginformation among teacher and the student, the Mobile-NetV2 reaches 96.3 percent accuracy on validation set whichwe claim as a state-of-the-art result for histological images ofthe breast. The limited accuracy obtained through these con-ventional methods prompted us to run the experiment withbilateral-interactive KD. As knowledge distillation is nothingbut a regularization technique, the student network isseldom overfitted and it can backpropagate the errors.Therefore, there is a constant feedforward and backpropaga-tion across the different phases of the bilateral KD model. Ineach phase, the labels are smoothed with regularization andhence the forth accuracy improves with each phase. All theresults we discussed can be summarized in Table 2 below.We also conclude that with the increase of temperature,the distribution of soft targets in the teacher network mimicsthe uniform distribution of label smoothing. The otherimportant observation in this experiment is the fact thatbilateral KD does not need additional distillation lossesand is in accordance with the conventional KD method.

0 20 40 60 80 100 120 140 160 180

34

30

42

Train_accuracytag: Train_accuracy

46

50

38

0 20 40 60 80 100 120 140 160 180

38

34

46

Test_accuracytag: Test_accuracy

50

54

42

Figure 6: Normal knowledge distillation results.

0 20 40 60 80 100 120 140 160 180 200

45

35

65

Train_accuracytag: Train_accuracy

75

55

0

36

40 80 120 160 200

40

44

48

Test_accuracytag: Test_accuracy

52

56

60

Figure 7: Reverse knowledge distillation result.

8 Computational and Mathematical Methods in Medicine

6. Conclusion

In this paper, we have discussed the results obtained in twobaseline papers on knowledge distillation and interactiveknowledge distillation. We proposed a novel method calledbilateral knowledge distillation on breast histological images.In our approach, we took ResNeXt29 and MobileNetV2 asthe baseline pretrained models. The phenomenon thatknowledge distillation is nothing but a label smoothingregularization technique was established yet again throughthe experiments conducted on histological image dataset ofthe breast. Additionally, the idea that label smoothing regu-larization is an ad hoc KD is reestablished. The interactivebilateral KD model which we proposed conforms to the factthat the teacher and student blocks interact with each otherand is capable of transferring knowledge to each otherthrough the correspondence being established and that this

correspondence improves the accuracy in each subsequentphase. In normal KD, due to higher validation accuracyResNeXt29, being the more complex model has been chosenas the teacher network and the simple MobileNetV2 as thecorresponding student network and the teacher and thestudent networks were reversed for Re-KD. The limitedaccuracy obtained through these conventional methodsprompted us to run the experiment with bilateral-interactive KD. The choice of transfer learning architecturewas being validated by a baseline training. The result ofbilateral KD is that the MobileNetV2 network producedexcellent accuracy in phase 3 of the validation. Being a light-weight model, MobileNetV2 architecture will supportdeployment of the model in mobile applications also. Thechallenges of working with large-sized histological imagesand computational complexity can thus be avoided. Thisapproach produces state-of-the-art results with an accuracyof 96.3 percent. It will also boost the performance of the stu-dent model on complex and diverse cancer cell images.Moreover, the model does not rely on copying the teacherbehaviour, but rather it can utilize the feature transforma-tional ability of the teacher. Going forward, we believe thatthis paper will help researchers to work with histologicaldata of other cancers as well.

Data Availability

Data will be made available on request.

Conflicts of Interest

The authors declare that there is no conflict of interestregarding the publication of this paper.

5 10 15 20 25 30 35 40 45 50 55 60

5040

70

Train_accuracy

8090

100

60

5 10 15 20 25 30 35 40 45 50 55 60

0.5

0

1.5

Test_loss

1

Figure 8: Results of training and validation using bilateral knowledge distillation.

Table 2: Results of all experiments performed.

Experimentperformed

ModelValidationaccuracy

Baseline teachertraining

ResNeXt29 75%

MobileNetV2 60.71%

Normal KDResNeXt29 teacher

60.71%MobileNetV2 student

Reverse KDMobileNetV2 teacher

65.58%ResNeXt29 student

Bilateral KDPhase 1+phase 2

+phase 396.3%

9Computational and Mathematical Methods in Medicine

Acknowledgments

This research is self-funded.

References

[1] P. Ratta, A. Kaur, S. Sharma, M. Shabaz, and G. Dhiman,“Application of blockchain and Internet of Things in health-care and medical sector: applications, challenges, and futureperspectives,” Journal of Food Quality, vol. 2021, Article ID7608296, 2021.

[2] S. McGuire, “World Cancer Report 2014. Geneva, Switzerland:World Health Organization, International Agency forResearch on Cancer, WHO Press, 2015,” Advances in Nutri-tion, vol. 7, no. 2, pp. 418-419, 2016.

[3] R. A. Smith, V. Cokkinides, and H. J. Eyre, “American CancerSociety Guidelines for the Early Detection of Cancer, 2006,”CA: a Cancer Journal for Clinicians, vol. 56, no. 1, pp. 11–25,2006.

[4] P. Filipczuk, T. Fevens, A. Krzyzak, and R. Monczak, “Com-puter-aided breast cancer diagnosis based on the analysis ofcytological images of fine needle biopsies,” IEEE Transactionson Medical Imaging, vol. 32, no. 12, pp. 2169–2178, 2013.

[5] M. Kowal, P. Filipczuk, A. Obuchowicz, J. Korbicz, andR. Monczak, “Computer-aided diagnosis of breast cancerbased on fine needle biopsy microscopic images,” Computersin Biology and Medicine, vol. 43, no. 10, pp. 1563–1572, 2013.

[6] B. Wang, X. Yao, Y. Jiang, C. Sun, and M. Shabaz, “Design of areal-time monitoring system for smoke and dust in thermalpower plants based on improved genetic algorithm,” Journalof Healthcare Engineering, vol. 2021, Article ID 7212567, 2021.

[7] X. Huang, V. Jagota, E. Espinoza-Muñoz, and J. Flores-Albornoz, “Tourist hot spots prediction model based onoptimized neural network algorithm,” International Journalof System Assurance Engineering and Management, vol. 12,2021.

[8] V. Jagota, A. P. S. Sethi, and K. Kumar, “Finite elementmethod: an overview,”Walailak Journal of Science & Technol-ogy, vol. 10, no. 1, pp. 1–8, 2013.

[9] L. Chen, V. Jagota, and A. Kumar, “Research on optimizationof scientific research performance management based on BPneural network,” International Journal of System AssuranceEngineering and Management., vol. 12, 2021.

[10] D. C. Koboldt, R. S. Fulton, M. D. McLellan et al., “Compre-hensive molecular portraits of human breast tumours,”Nature, vol. 490, no. 7418, pp. 61–70, 2012.

[11] S. R. Lakhani, I. O. Ellis, S. J. Schnitt, P. H. Tan, M. J. van deVijver, and World Health Organisation, Classification ofTumours of the Breast, WHO Press, 4th edition, 2012.

[12] M. Veta, J. P. W. Pluim, P. J. Van Diest, and M. A. Viergever,“Breast cancer histopathology image analysis: a review,” IEEETransactions on Biomedical Engineering, vol. 61, no. 5,pp. 1400–1411, 2014.

[13] Z. Han, B. Wei, Y. Zheng, Y. Yin, K. Li, and S. Li, “Breast can-cer multi-classification from histopathological images withstructured deep learning model,” Scientific Reports, vol. 7,no. 1, 2017.

[14] C. Sharma, B. Amandeep, R. Sobti, T. K. Lohani, andM. Shabaz, “A secured frame selection-based video water-marking technique to address quality loss of data: combininggraph-based transform, singular valued decomposition, and

hyperchaotic encryption,” Security and Communication Net-works, vol. 2021, Article ID 5536170, 2021.

[15] T. Kausar, M. A. Ashraf, A. Kausar, and I. Riaz, “ConvolutionNeural Network based Approach for Breast Cancer Classifica-tion. Proceedings of 18th International Bhurban Conferenceon Applied Sciences and Technologies,” in 2021 InternationalBhurban Conference on Applied Sciences and Technologies(IBCAST), pp. 407–413, Islamabad, Pakistan, 2021.

[16] D. Praveena Anjelin and S. Ganesh Kumar, “Blockchain tech-nology for data sharing in decentralized storage system,” inIntelligent Computing and Applications (vol. 1172), Springer,Singapore, 2021.

[17] L. J. Ba and R. Caruana, “Do deep nets really need to be deep?,”Advances in Neural Information Processing Systems, vol. 3,pp. 2654–2662, 2014.

[18] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledgein a neural network,” 2015, http://arxiv.org/abs/1503.02531.

[19] M. K. Sharma, T. Perumal, and N. Dey, “Innovations incomputational intelligence and computer vision,” inAdvances in Intelligent Systems and Computing, p. 1189,Springer, 2021.

[20] C. Bucilǎ, R. Caruana, and A. Niculescu-Mizil, “Model com-pression,” in Proceedings of the 12th ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Mining- KDD '06, pp. 535–541, New York, 2006.

[21] J. Dipalma, A. A. Suriawinata, L. J. Tafe, L. Torresani, andS.Hassanpour, “Resolution-based distillation for efficient histol-ogy image classification,” 2021, http://arxiv.org/abs/2101.04170.

[22] X. Dai, Z. Jiang, Z. Wu et al., “General instance distillation forobject detection,” 2021, http://arxiv.org/abs/2103.02340.

[23] S. Ge, S. Zhao, C. Li, and J. Li, “Low-resolution face recognitionin the wild via selective knowledge distillation,” IEEE Transac-tions on Image Processing, vol. 28, no. 4, pp. 2051–2062, 2019.

[24] T. He, C. Shen, Z. Tian, D. Gong, C. Sun, and Y. Yan,“Knowledge adaptation for efficient semantic segmentation,”in 2019 IEEE/CVF Conference on Computer Vision and Pat-tern Recognition (CVPR), pp. 578–587, Long Beach, CA,USA, 2019.

[25] T. K. K. Ho and J. Gwak, “Utilizing knowledge distillation indeep learning for classification of chest X-ray abnormalities,”IEEE Access, vol. 8, pp. 160749–160761, 2020.

[26] B. H. Tang and J. Wang, “A physics-based method to retrieveland surface temperature from MODIS daytime midinfrareddata,” IEEE Transactions on Geoscience and Remote Sensing,vol. 54, no. 8, pp. 4672–4679, 2016.

[27] A. Garg, K. Aggarwal, M. Saxena, and A. Bhat, “Classifyingmedical histology images using computationally efficientCNNs through distilling knowledge,” in Emerging Technolo-gies in Data Mining and Information Security, J. M. R. S.Tavares, S. Chakrabarti, A. Bhattacharya, and S. Ghatak,Eds., pp. 713–721, Springer, Singapore, 2021.

[28] J. J. Thiagarajan, S. Kashyap, and A. Karargyris, “Distill-to-label: weakly supervised instance labeling using knowledge dis-tillation,” in 2019 18th IEEE International Conference onMachine Learning and Applications (ICMLA), pp. 902–907,Boca Raton, FL, USA, 2019.

[29] S. Fu, Z. Li, J. Xu,M.-M. Cheng, Z. Liu, and X. Yang, “Interactiveknowledge distillation,” 2020, http://arxiv.org/abs/2007.01476.

[30] Y. Li, “Performance evaluation of machine learning methodsfor breast cancer prediction,” Applied and ComputationalMathematics, vol. 7, no. 4, p. 212, 2018.

10 Computational and Mathematical Methods in Medicine

[31] L. Yuan, F. E. H. Tay, G. Li, T. Wang, and J. Feng, “Revisitingknowledge distillation via label smoothing regularization,” in2020 IEEE/CVF Conference on Computer Vision and PatternRecognition (CVPR), pp. 3902–3910, Seattle, WA, USA, 2020.

[32] G. Aresta, T. Araújo, S. Kwok et al., “BACH: grand challengeon breast cancer histology images,” Medical Image Analysis,vol. 56, pp. 122–139, 2019.

11Computational and Mathematical Methods in Medicine