statistical applications of classification algorithms

Statistical Applications of Classification Algorithms Related to the Integration of Turkey

into the European Union

Author: Abdül Kuddüs Sünbül

16203822

The thesis is submitted to University College Dublin in part fulfillment of the requirements for

the degree of MSc Statistics

Supervisor: Dr. Riccardo Rastelli

August 2019

Table of Contents

List of Figures 4

List of Tables 5

Abstract 6

1. Introduction 7

1.1. Background and context 7

1.1.1. What's all about lead? 7

1.1.2. What are the lead isotope ratios and how is the data collected? 7

1.1.3. Reasons to be cautious 8

1.1.4. Relevant research questions and hypotheses 9

1.2. Motivation 9

1.2.1. Expected contributions to the area 9

1.2.2. Challenges 10

1.3. Research question 10

2. Exploratory Analysis and Comments 11

2.1. Data overview 11

2.2. Skewness 15

2.3. Histograms 16

2.4. The use of PCA and scatter plots 17

3. Theory and Methods 19

3.1. Understanding supervised and unsupervised learning 20

3.2. Statistical background and theoretical explanations of the methods 20

3.2.1. Decision Trees 21

3.2.2. Bagging 21

3.2.3. Random Forests 22

3.2.4. Boosting (Adaboost) 22

3.2.5. Support vector machines (SVM) 23

3.2.6. K-nearest neighbors (KNN) 23

3.2.7. Logistic Regression 24

3.2.8. Neural Networks 24

3.2.9. K-means 24

3.2.10. The usage of Mclust package in R 25

3.3. Application of methods 25

3.3.1. K-fold cross validation and train-test split 26

3.3.2. Grid search 27

3.3.3. Application in Python and R 27

3.3.4. The use of mclust and why it was used against k-means 27

3.3.5. Application of neural networks 28

4.1. Performance measures 29

4.2. Error rates and their rankings for country pairs 31

4.3. Countries with their most erroneous pairs 32

4.4. Misclassification rates for the regions 33

4.5. Clusters 34

5.1. Why logistic regression fails? 37

6. Conclusion and Possible Future Work 39

References 41

Appendix 44

4

List of Figures

Figure 1: Samples of ingots 2

Figure 2: The histograms of columns 9

Figure 3: The histograms of columns for each country 10

Figure 4: Scatter plot with variables obtained using PCA 11

Figure 5: Pairwise scatter plot 12

Figure 6: The graphs show the results of K-means clustering and Mclust package from left to right

20

Figure 7: Accuracy rates comparison 21

Figure 8: F1 Score Comparison 22

5

List of Tables

Table 1: Sample Data 4

Table 2: Summary statistics of the data 5

Table 3: Country sample statistics 6

Table 4: Region sample statistics 7

Table 5: Confusion matrix of KNN model 23

Table 6: A part of pair error rates table 23

Table 7: A part of error ranking table 24

Table 8: A sample of the data used to create correlation coeeficient 24

Table 9: Countries with their most erroneous pairs 25

Table 10: The misclassification rates table for the regions 26

Table 11: Clusters created by mclust package 27

6

Abstract

This study focuses on the use of lead isotope ratios of sample artefacts in machine learning context

to support the integration of European Union. The importance of lead and its isotope ratios come

from its distinctive feature. Especially in the last 20 years, the lead isotope ratios have become

prominent data to identify the provenance of an ingot. Since, they are always contained in a greater

or lesser quantity in normal metals. Conversion processes such as weathering or smelting is not to

be expected in such isotope systems and this situation makes using lead isotope ratios reliable in

studies. Also, there was no implementation of machine learning algorithms on a lead isotope ratio

dataset before. That’s why this field was a good choice in order to observe the success of machine

learning methods. These applications and their results are useful to reveal the integration of Turkey

and the European Union.

In this dissertation, a lead isotope ratio dataset consisting of 5 different isotope ratios, country and

region columns has been used in the implementation of the methods. Country column has been

used as target label for the classification algorithms. The classification algorithms used are decision

trees, bagging, adaboost, random forests, logistic regression, k-nearest neighbors and support

vector machines (SVM). Also, simple implementation of neural networks was another

classification approach that has been utilized. For the clustering part, k-means algorithm and mclust

package in R have been used to get the groups of the samples. The results of classification methods

have been compared each other with several performance metrics. Misclassified samples have been

used while interpreting the results.

The average success rates for ensemble models was approximately 86%. While SVM and decision

trees algorithms perform well, logistic regression model has poor results. The results of mclust

package’s application has been used in the interpretation of misclassified samples.

7

1. Introduction

1.1. Background and context

The origin of an antiquity brought to light by archaeology is one of the hardest things to solve. The

origin of metals, in particular, in the form of ingots, have normally been detected using the scientific

methods of antiquity. These methods such as the study of the content of the wreck and epigraphy

supply essential information. Over the past 20 years, archaeological science known as

archaeometry consisting of the application of scientific techniques to the analysis of archaeological

materials to assist in dating and origin of the materials took over the flag in the studies. In order to

analyze the origin of the wreck, chemical composition of the ingots and searching for trace

elements has gained importance. The lead isotope ratios have become prominent data to identify

the provenance of an ingot in these kind of studies. The ratios are always contained in a greater or

lesser quantity in normal metals.1

1.1.1. What's all about lead?

Why the lead is so much important in archaeology has been explained by Dodson in very simple

and persuasive way. He explains the situation with these sentences. “Lead, offering as it does a

convenient combination of density and formability, is the first line of defense for radiation

shielding. However, newly smelted lead contains a radioactive lead isotope, Pb-210, which is

generated in the decay of U-238. While the uranium and other radioactive elements are largely

removed during the smelting process, the Pb-210 remains, producing a low-level radioactive decay

(about 200 decays per kilogram per second) that restricts the ability of the most sensitive nuclear

and particle physics experiments to function. Pb-210, however, has a 22.3 year half-life. When lead

bars have lain underwater for 2,000 years, all of the Pb-210 has decayed, leaving "Roman lead" (or

old lead) with a radioactive level roughly 100,000 times lower than is found in new lead.”2

1.1.2. What are the lead isotope ratios and how is the data collected?

The explanation of lead makes more sense with a fact stated by Bode, Hauptmann and Mezger.

They say that “A shift in the isotope composition from conversion processes such as weathering or

smelting is not to be expected in such isotope systems.”3 Lead isotope ratios are also associated in

8

turn in different quantities in the mineral from which the lead is extracted, naturally variable

isotopic compositions are produced which make it possible to compare the isotopic compositions

of the minerals with those of the manufactured products. This is the reason why the ratio of isotopes

used is to get more sensitive indicators.1 The ratios generally used in the archaeometric context are

such as 206Pb/207Pb and 208Pb/206Pb. In the light of these explanations, the lead isotope ratios

of ingots can be used in the application of scientific techniques regardless of its age.

Figure 1: Samples of ingots

1.1.3. Reasons to be cautious

However, there are still some remarkable reasons to be cautious as deciding the lead provenance

of an archeological artefacts. Trincherini and Barbero explains these reasons on an example in their

article. The article states that “two explanations are possible. One is global: commercial practices

(the shape of the ingots, and the way of marking them) could have become uniform from one region

(Hispania) to another (Gallia). The second is more particular: operators of Spanish mines might

have moved to the south of Gaul to operate mines there, taking their working habits with them. The

last argument to be debated is that of the loading of the boat itself--Gallic ingots below, and

Baetican amphorae over them. On this point, one may imagine that there was a trade of

redistribution from a port of the Languedoc coast--Narbonne, Agde or Lattes--in which the ingots

that were produced locally would normally have been loaded first, followed by the Baetican oil

amphorae, which had been brought there previously by another ship.” These reasons explained on

9

an example make sense in general. Since, similar scenarios happen every day and so can be the

case in different situations. Therefore, it is expected that distinguishing some artefacts from each

other using their lead isotope ratio values will not be easy despite their different origins.

1.1.4. Relevant research questions and hypotheses

When the use of lead isotope ratios in archeology is searched on the literature, it can be seen that

there are some articles using lead isotope ratios as significant explanatory variables. Some of these

articles aim to reveal the success of lead isotope ratios to distinguish the provenance of

archeological artefacts. An article titled as “Provenance evidence for Roman lead artefacts of

distinct chronology from Portuguese archaeological sites” which has been published by Gomes,

Araújo, Soares and Correia in 2017 is an example of using lead isotope ratios to distinguish the

provenance of archeological artefacts.

This study focuses on the comparison of variations of lead isotope ratios on some materials found

at Alto dos Cacos and Conimbriga. The intention is to investigate whether the source of leaden raw

materials changed over the time and, if so, trying to identify those different sources.4 Some other

papers focus on to clarify the provenance of lead artefacts from military fortresses and camps

located in some regions. As known, lead was one of the significant metals used for military

purposes throughout history. These researches use the lead isotope ratio data for further statistical

analysis such as creating confidence intervals for regions or plotting them on coordinate plane to

demonstrate differences.

1.2. Motivation

1.2.1. Expected contributions to the area

During detailed literature review, although it is possible to predict the origins of archeological

samples using classification algorithms, the use of any machine learning technique on a lead

isotope ratio data has not been found. That’s why, using sensitive lead isotope ratio indicators with

different machine learning approaches has been chosen as a main focus of this article. In this study,

the origins of samples has been predicted using lead isotope ratios as independent variables in

10

classification algorithms. Some misclassified samples have been obtained with the implementation

of these methods. Moreover, clustering approach has been used to reveal the groups among the

artefacts. Both results obtained by classification and clustering algorithms have been examined by

considering possible relations between them. As interpreting the results, the drawbacks stated by

Trincherini and Barbero have been taken into consideration. As a result of all of these approaches,

the goal was to see the advantages and disadvantages of certain machine learning implementations

on a dataset which has not been used in this purpose before.

1.2.2. Challenges

A struggle of this project is that the dataset contains only lead isotope values. If there are prior

information used by archeologists to determine the origins such as epigraphy of the artefacts, they

could not be used in this study. The success of this study also depends on the environment and

technologies which have been used for measuring the lead isotope values.

1.3. Research question

The perspective in this study is if the provenances of ingots can be determined using the lead

isotope ratios in archeology, it should also be possible to identify the origins of these samples by

using efficient machine learning methods. Therefore, the question has been answered is whether

some of the machine learning applications are useful and successful to determine the origins of

sample artefacts using their isotope ratios. If the success rates of these algorithms are better than

being random, then the next step is to increase the level of the success and accuracy rates of the

implementations by using optimized hyper-parameters for the models. The results of the algorithms

have been compared with each other by using different performance measures and error rates. Also,

by means of clustering approach, possible connection between misclassified samples and regions

has been revealed and the results has been visualized on the map to make findings easily

understandable. Using these misclassified samples and clustering results, the possible reasons

remarked in other articles have been assessed.

11

2. Exploratory Analysis and Comments

2.1. Data overview

The data used in the study contains the isotopic ratios of lead in such forms 208Pb/206Pb and

207Pb/204Pb. There are country and region information for each sample. 4363 different

archeological artefacts with 5 different lead isotope ratios and country column as a target label have

been used to apply different machine learning methods. Region column has been used while

interpreting the results with respect to the misclassified samples of classification algorithms. A

sample of the dataset can be seen in table 1.

Table 1: Sample Data

Count

ry

Regi

on

208Pb/2

06Pb

206Pb/20

7Pb

206Pb/20

4Pb

207Pb/20

4Pb

208Pb/20

4Pb

0 Egypt Timn

a 2.07167 1.065417 18.800

17.64568

0

38.94739

6

1 Eire NaN 2.10380 1.094451 18.178 16.60923

9

38.24287

6

2 Eire Galw

ay 2.14261 1.117743 17.253

15.43556

9

36.96645

0

3 Egypt

Gaba

l El

Ineig

i

2.11090 1.139134 17.630 15.47667

2

37.21516

7

4 Italy Sardi

nia 2.12330 1.141279 17.812

15.60705

3

37.82022

0

These samples come from 17 countries and 460 regions. As seen from the table above there are

some NaNs in the dataset. First step was to examine the number of NaNs and decide how to handle

with them. 3 of 5 lead isotope ratio columns have 20 missing values, but these missing values are

in the same rows. 20 rows with missing values correspond to approximately 0.005% of whole data.

If the rows with missing values are ignored, any important information that can be obtained from

the dataset will not be lost. Therefore, they were just excluded from the dataset.

12

The new dataset has 4343 samples from 16 countries and 453 regions. Portugal had 1 sample with

missing values, so the country was excluded from the dataset completely. Table 2 shows the

summary statistics of the data.

Table 2: Summary statistics of the data

208Pb/206

Pb

206Pb/207

Pb

206Pb/204

Pb

207Pb/204

Pb

208Pb/204

Pb

count 4343 4343 4343 4343 4343

mean 2.081477 1.185914 18.569244 15.658071 38.644838

std 0.022437 0.019471 0.332109 0.071317 0.357808

min 1.635600 1.065417 17.253000 14.995782 36.966450

25% 2.069010 1.168681 18.290000 15.628000 38.367965

50% 2.078440 1.191909 18.622000 15.658252 38.666448

75% 2.098257 1.198804 18.807000 15.689925 38.943319

max 2.142610 1.465846 23.690000 17.645680 41.061193

204Pb values are relatively smaller than the other lead isotope values so, the ratios including 204Pb

values have bigger means. Mean and median values are close to each other for all columns. It helps

to state that the data has not many outliers. Also, especially for the 206Pb/207Pb column, minimum

and maximum values are not close to next quartile values and this issue has been assessed by means

of other plots created in next steps.

The next thing was to examine number of samples for each country and region to determine how

and which machine learning algorithms should be constructed based on the data. As seen from the

table 3, 8 countries which are Greece, Italy, Spain, Cyprus, Turkey, England, Wales and Bulgaria

forms 97.4% of whole samples. Only these 8 countries have been used in analyses. The reason to

choose these countries is there is breaking point in numbers after 8th country, Bulgaria, from 3.1%

13

to 1.1%. Thus, every country will surely be represented in each cross validated training sets and

test set.

Table 3: Country sample statistics

Cou

nt

Percentag

e

Cumulative

P.

Greece

1338 30.8% 30.8%

Italy 770 17.7% 48.5%

Spain 728 16.8% 65.3%

Cyprus 490 11.3% 76.6%

Turkey 361 8.3% 84.9%

Englan

d

261 6.0% 90.9%

Wales 146 3.4% 94.3%

Bulgari

a

136 3.1% 97.4%

Eire 47 1.1% 98.5%

Egypt 31 0.7% 99.2%

Scotlan

d

23 0.5% 99.7%

Syria 6 0.1% 99.9%

France 3 0.1% 99.9%

Algeria 1 0.0% 100.0%

Palestin

e

1 0.0% 100.0%

As seen from table 4, only 23 regions, approximately 5% of all regions, form more than 50% of all

samples. There are many regions with a few samples, so applying classification algorithms with

14

region column as target label would not be appropriate. That’s why country column with 8

countries was selected as target label in this study.

Table 4: Region sample statistics

Count Percentag

e

Cumulative

P.

Sardinia 468 10.9% 10.9%

Cyclades 304 7.1% 17.9%

Sierra Morena 185 4.3% 22.2%

Attica 142 3.3% 25.5%

Sud-Est 117 2.7% 28.3%

Province of Huelva 98 2.3% 30.5%

Larnaca 97 2.3% 32.8%

Province of

Almeria

96 2.2% 35.0%

Province of Sevilla 80 1.9% 36.9%

Solea 63 1.5% 38.3%

Burgas district 48 1.1% 39.5%

Sardinia,

Iglesiente

48 1.1% 40.6%

Limni 47 1.1% 41.7%

Crete 45 1.0% 42.7%

Cornwall 44 1.0% 43.7%

Gwynedd 43 1.0% 44.7%

Peloponnese 40 0.9% 45.7%

Dyfed

(Cardiganshire)

39 0.9% 46.6%

Cumbria 37 0.9% 47.4%

15

Seriphos,

Moutoulos

37 0.9% 48.3%

Cyclades, Siphnos 36 0.8% 49.1%

Rhodope 34 0.8% 49.9%

Othrys Mountains 31 0.7% 50.6%

2.2. Skewness

Before further analyses, the question whether the data has to be log-transformed or not needs to be

answered. Since, if the data is skewed, it is not proper to use in classification methods. In order to

check out the skewness, the histograms of all columns of original and log-transformed data have

been created. The results can be seen in figure 2. As seen, there are only a few outliers and there is

no need to deal with them.

Figure 2: The histograms of columns

16

2.3. Histograms

As seen, the histograms in both sides have similar shapes. It means there is no need to log-transform

the data, since the data is not skewed. Therefore, original dataset will be used for further analyses.

The important thing needs to be handled with is that the histograms have multiple peaks. The reason

of more than one peak could be the difference between the lead isotope ratios of different countries

or there could be differentiation between the values of regions of some countries. In order to

observe this issue, the histograms of every column for each country have been created and can be

seen in figure 3.

Figure 3: The histograms of columns for each country

It is observed that there are some countries such as Greece and England with similar samples and

they generally have one peak. On the other side, there are some countries especially Spain with

more than one peak. When Spain samples are clustered into two using K-means algorithm, the

results state that Spain samples are separated like east and west. Huelva, Seville, Cadiz are some

examples of west cluster and Almeria and Murcia are examples of east cluster in this result. It

17

means that map boundaries of today’s world may not overlap with historical facts. This situation

has been taken into consideration while examining the results of methods.

2.4. The use of PCA and scatter plots

Another approach for explanatory analysis was demonstrate samples of each country on scatter

plot. However, scatter plots need only two columns explaining the data. That’s why Principal

Component Analysis (PCA) has been used to decrease the number of columns from 5 to 2. PCA is

a statistical procedure that uses an orthogonal transformation to convert a set of observations of

possibly correlated variables into a set of values of linearly uncorrelated variables called principal

components. This transformation is defined in such a way that the first principal component has

the largest possible variance (that is, accounts for as much of the variability in the data as possible),

and each succeeding component in turn has the highest variance possible under the constraint that

it is orthogonal to the preceding components. The resulting vectors (each being a linear

combination of the variables and containing n observations) are an uncorrelated orthogonal basis

set.5

For this purpose, I have used PCA function of scikit-learn library of Python which reduces

dimensionality using singular value decomposition of the data to project it to a lower dimensional

space.6 Two components obtained as a result, explains 99.2% of the variance in the data. Therefore,

scatter plot used with these values are reliable to have an opinion about the distinction of countries

from each other. Figure 4 shows that the countries have not distinctly separated from each other,

so it is expected to have some groups in the dataset with samples from different countries.

18

Figure 4: Scatter plot with variables obtained using PCA

When the dataset has been examined with pairwise scatter plots in figure 5, similar result with PCA

has also been observed.

19

Figure 5: Pairwise scatter plot

3. Theory and Methods

The purpose of this study is to observe the success and usability of some statistical machine

learning algorithms on the lead isotope ratio data of archeological artefacts. These artefacts come

from 8 different countries and much more regions. In statistical modeling, the values of dependent

variables depend on the values of independent variables. The dependent variables represent the

output or outcome whose variation is being studied. The independent variables represent inputs

20

that is potential reasons for variation.7 In this case, independent variables are lead isotope ratios

and dependent variable is country information of the samples.

3.1. Understanding supervised and unsupervised learning

There are common learning scenarios in machine learning. These scenarios differ in the types of

training data available to the learner, the order and method by which training data is received and

the test data used to evaluate the learning algorithm. Two of the learning methods are applicable

for this study which are supervised and unsupervised learning methods. In supervised learning, the

learner receives a set of labeled examples (lead isotope ratios with country column as label) as

training data and makes predictions for all unseen points. This is the most common scenario

associated with classification problems and it will be used in this study. In unsupervised learning,

the learner exclusively receives unlabeled training data, and makes predictions for all unseen

points.8 In this study, although the data have labels, it will be ignored to apply clustering methods.

The purpose is to obtain groups regardless of the countries or regions in the first place. Thus, the

relation of the error rates for each country pair with these clusters will be interpreted.

3.2. Statistical background and theoretical explanations of the methods

In the area of classification, there are some algorithm types which are linear classifiers with the

examples of logistic regression and support vector machine, boosting algorithms, decision trees

and random forests using decision trees as base algorithm and neural networks. Among many

classification algorithms, 7 of them and neural networks have been used in this study. These 7

algorithms are decision trees, random forests, bagging, boosting (adaboost), support vector

machines, k-nearest neighbor and logistic regression. The reason of choosing these algorithms is

to apply wide range of classification algorithms compatible with the data. Also, two clustering

methods have been used which are k-means clustering and Finite Gaussian mixture modelling via

mclust package in R. In the next part, these models will be explained shortly with their statistical

background.

21

3.2.1. Decision Trees

Decision trees begins with the question "which attribute should be tested at the root of the tree?' To

answer this question, each instance attribute is evaluated using a statistical test to determine how

well it alone classifies the training examples. The statistical test which is called in other words as

information gain measures how well a given attribute separates the training examples according to

their target classification. Entropy is one way to characterize the (im)purity of an arbitrary

collection of examples for information gain. It is a set of examples S with class labels, where pi is

the relative frequency (probability) of each class and formulated as follows,

The best attribute is selected and used as the test at the root node of the tree. A descendant of the

root node is then created for each possible value of this attribute, and the training examples are

sorted to the appropriate descendant node (i.e., down the branch corresponding to the example's

value for this attribute). The entire process is then repeated using the training examples associated

with each descendant node to select the best attribute to test at that point in the tree.9

3.2.2. Bagging

Bagging is a short name for bootstrap aggregating and an ensemble learning method. It is a general

idea that can be used with any predictive method. It is very often used for classification.

Bagging works as follows:

1. Take a bootstrap sample of your data. Recall that approximately 0.632N (N is total number

of all samples) unique observations will be in the bootstrap sample.

2. Fit a classifier to the bootstrap sample. (In this study, classifier is decision trees)

3. Use the classifier on the non-bootstrap samples to assess the accuracy. This is called the

Out Of Bag prediction.

4. Repeat M times.

If there is a new sample that is wanted to be classified, it uses all of the classification trees

to classify it. The majority class is used to produce a final decision.10 So, it can be said that it is an

upgraded version of decision trees.

22

3.2.3. Random Forests

Random forests are also an ensemble learning method for classification trees (decision trees) that

operates by constructing a multitude of decision trees at training time and outputting the class that

is the mode of the classes of the individual trees.11 Random forests correct for decision trees' habit

of overfitting to their training set.12 Random forests create different datasets by sampling n training

examples from the original dataset with replacement and train decision trees on each newly created

dataset. How random forests work is same with bagging analogy till this point. However, while

sampling n training examples, independent variables are also sampled without replacement in

random forests. Again after training, predictions for unseen samples be made by taking the

majority. So, it can be said that random forests is an upgraded version of bagging.

The use of a different random subset of the variables at each split of the classification tree makes

the trees even more diverse than bagging algorithm.

3.2.4. Boosting (Adaboost)

The idea in boosting is that the output of the other learning algorithms, weak learners (decision

trees in this study) is combined into a weighted sum that represents the final output of the boosted

classifier. AdaBoost is adaptive in the sense that subsequent weak learners are tweaked in favor of

those instances misclassified by previous classifiers. That’s why AdaBoost can be sensitive to noisy

data and outliers. As all ensemble methods, the individual learners can be weak, but as long as the

performance of each one is slightly better than random guessing, the final model can be proven to

converge to a strong learner. When used with decision tree learning, information gathered at each

stage of the AdaBoost algorithm about the relative 'hardness' of each training sample is fed into the

tree growing algorithm such that later trees tend to focus on harder-to-classify examples.13 With

this aspect, Adaboost is completely different than other ensemble methods used in this study which

are bagging and random forests. Adaboost works algorithmically as follows14:

1. Initialize the observation weights: wi = 1/N, i = 1, 2,... ,N

2. For m = 1 to M repeat steps (a)–(d):

a. Fit a classifier Gm(x) to the training data using weights wi

b. Compute

23

c. Compute αm = log((1 − errm)/errm)

d. Update weights for i = 1,... ,N: and

renormalize wi to sum to 1

3. Output

3.2.5. Support vector machines (SVM)

Support vector machines are mathematically more complex method than the other methods used

in this study. Therefore, it will be explained with the simplest words as much as possible without

going into complex models. An SVM model is a representation of the examples as points in space,

mapped so that the examples of the separate categories are divided by a clear gap that is as wide as

possible. New examples are then mapped into that same space and predicted to belong to a category

based on the side of the gap on which they fall. The gaps are separated by hyperplanes which

represents the largest separation, or margin, between the two or more classes.15

In addition to performing linear classification, SVMs can efficiently perform a non-linear

classification using what is called the kernel trick, implicitly mapping their inputs into high-

dimensional feature spaces.11 In this study both type of SVMs have been applied via GridSearch

process of scikit-learn library of Python.

3.2.6. K-nearest neighbors (KNN)

K-nearest neighbors algorithm is one of easiest algorithms to understand. The idea is very simple.

In k-NN classification, the output is a class membership. An object is classified by a plurality vote

of its neighbors, with the object being assigned to the class most common among its k-nearest

neighbors (k is a positive integer).

A useful technique to increase success in k-NN can be to assign weights to the contributions of the

neighbors, so that the nearer neighbors contribute more to the average than the more distant ones.

24

For example, a common weighting scheme consists in giving each neighbor a weight of 1/d, where

d is the distance to the neighbor.16

3.2.7. Logistic Regression

In the multiclass case, the training algorithm in scikit-learn library of Python need to use the one-

vs-rest (OvR) scheme.17 In one-vs-rest, it is trained C separate binary classification models. Each

classifier fc, for c ∈ {1,…,C} is trained to determine whether or not an example is part of class c or

not. To predict the class for a new example x, it is run all C classifiers on x and choose the class

with the highest score: y =arg maxc ∈ {1,…,C} fc(x).

One main drawback is that when there are lots of classes, each binary classifier sees a highly

imbalanced dataset, which may degrade performance.18

3.2.8. Neural Networks

Neural networks is actually huge topic to explain in a paragraph. It is not just a machine learning

method like other methods used in this study. Its applications are called as deep learning in

literature. Therefore, it will be just defined here with the inspiration behind it. Artificial Neural

Networks (ANNs) which is a type of neural networks used in this study, are inspired by biological

nervous systems, such as the brain, where large numbers of interconnected neurons work together

to solve a problem. A neuron receives signals from other neurons through connections, called

synapses. The combination of these signals, in excess of a certain activation level, will result in the

neuron “firing” - i.e sending a signal on to other neurons connected to it. As a result of long chains

of computational stages a multi-class predictions are obtained with the help of activation function

at the end.

3.2.9. K-means

K-means clustering is one of the simplest and popular unsupervised machine learning algorithms.

To process the learning data, the K-means algorithm starts with a first group of randomly selected

centroids, which are used as the beginning points for every cluster. In next step, iterative

25

calculations are performed to optimize the positions of the centroids. It halts creating and

optimizing clusters when either:

- The centroids have stabilized — there is no change in their values because the clustering

has been successful.

- The defined number of iterations has been achieved.19

Moreover, clusters are assumed to be spherical and evenly sized. This situation can create problems

to use k-means as reliable clustering method in studies.

3.2.10. The usage of Mclust package in R

Mclust is a contributed R package for model-based clustering, classification, and density estimation

based on finite normal mixture modelling. It provides functions for parameter estimation via the

EM algorithm for normal mixture models with a variety of covariance structures, and functions for

simulation from these models. Also included are functions that combine model-based hierarchical

clustering, EM for mixture estimation and the Bayesian Information Criterion (BIC) in

comprehensive strategies for clustering, density estimation and discriminant analysis. Additional

functionalities are available for displaying and visualizing fitted models along with clustering,

classification, and density estimation results.20

This package have been used to get more reliable clusters for the data without having the same

issues of k-means algorithm.

3.3. Application of methods

The methods used in this study have been mainly applied on Python. The only application on R

was using mclust package for clustering the samples. In supervised learning, evaluating the

performance of methods is also important part of the study. For unsupervised learning methods,

there is no predictions made, so evaluating the models with performance measures is not the case.

Before applying machine learning methods, feature scaling has been applied to the dataset which

is a method used to normalize the range of independent variables or features of data. It is also

26

known in literature as data normalization and generally performed during the data preprocessing

step. Objective functions do not work properly without normalization in some machine learning

algorithms. Since, many classifiers calculate the distance between two points by the Euclidean

distance. By normalizing values, feature contributes approximately proportionately to the final

distance. Another benefit to apply feature scaling is that gradient descent converges much faster

with feature scaling than without it.

3.3.1. K-fold cross validation and train-test split

For classification problems, it is natural to measure a classifier’s performance in terms of the error

rate. The classifier predicts the class of each instance: if it is correct, that is counted as a success;

if not, it is an error. The error rate is just the proportion of errors made over a whole set of instances,

and it measures the overall performance of the classifier.21 For evaluating the performance, first

thing needs to be done is to set up train-test split for the dataset. The train and test datasets have

been constructed using 70-30% split in this study. Thus, the size of training set is 2961 and the size

of test set is 1269.

Another technique used while implementing the classification algorithms is k-fold cross validation.

The goal of k-fold cross validation is to test the model's ability to predict new data that was not

used in estimating it, in order to flag problems like overfitting or selection bias22 and to give an

insight on how the model will generalize to an independent dataset. The general procedure for k-

fold cross validation is as follows:

1. Split the dataset into k groups

2. For each unique group:

o Take the group as a hold out or test data set

o Take the remaining groups as a training data set

o Fit a model on the training set and evaluate it on the test set

Multiple rounds of cross-validation are performed and the validation results are averaged over the

rounds to give an estimate of the model's predictive performance.

27

3.3.2. Grid search

Grid search is an approach which can also be called as hyperparameter optimization. A model

hyperparameter is a characteristic of a model that is external to the model and whose value cannot

be estimated from data. These values have to be set before the learning process begins. For

example, k in k-Nearest Neighbors and the number of hidden layers in Neural Networks.

GridSearchCV function of scikit-learn library is useful for optimizing the parameters of the

estimator together with cross-validation.

3.3.3. Application in Python and R

All classification methods except neural networks have been applied using scikit-learn which is

machine learning library for Python. Neural networks has been applied using Keras library which

runs on top of TensorFlow. After applying classification methods, cross tabulation and confusion

matrices have been created using the best hyperparameters for each model. Using these matrices,

the results can easily be interpreted for each class.

3.3.4. The use of mclust and why it was used against k-means

In order to create clusters using mclust package, MclustDR function has been benefited from.

MclustDR function aims at reducing the dimensionality by identifying a set of linear combinations,

ordered by importance as quantified by the associated eigenvalues, of the original features which

capture most of the clustering or classification structure contained in the data.

As mentioned earlier, k-means clustering has some disadvantages. Clusters created by k-means

algorithm are assumed to be spherical and evenly sized. In order to visualize what is meant by

spherical and evenly sized, I have clustered original data into 7 groups using k-means. As seen

from figure 6 on the left, k-means algorithm groups samples close to each other with themselves.

On the other hand, mclust package has created clusters in different manner.

28

Figure 6: The graphs show the results of K-means clustering and Mclust package from left to right

3.3.5. Application of neural networks

Neural networks has been used in this study, briefly. Using GridSearchCV function different

hyperparameters have been applied at once such as adam and stochastic gradient descent

optimizers. Softmax has been used as an activation function, since the aim of the study is to predict

multiple classes in the dataset. There was one hidden layer in model structure with 200 nodes.

Epoch number which defines the number of times that the learning algorithm will work through

the entire training dataset was used as 1000. Also, the batch size is a hyperparameter that defines

the number of samples to work through before updating the internal model parameters and it was

used as 32. The reason why different numbers for epoch, hidden units and batch size were not used

is that the run time of this model was about half an hour. It starts with approximately %45 success

rate and ends with 72-73%. It means that the required duration to run simple neural network models

is much more than other machine learning methods.

If it is wanted to increase the success rate of neural networks on this dataset, I think it is possible

by using larger epoch numbers and different hyperparameters. Since, this article is the first one

applying machine learning methods on the lead isotope ratios, using classical methods such as

KNN and random forests was more preferable in the first place. Therefore, it was decided to frame

this study around conventional machine learning methods.

29

4. Results

4.1. Performance measures

Choosing appropriate performance measure is one of the most essential parts in machine learning

studies. The first performance metric comes to mind is classification accuracy to measure the

performance of the models, but only accuracy rate is not enough to truly judge the results. Precision

and recall are other important approaches to evaluate the results. Also, as a combination of

precision and recall, F1 score is beneficial to use and it has been used together with accuracy rate

to evaluate and compare the models in this study.

Accuracy is the number of correctly predicted data points out of all the data points. More formally,

it is defined as the number of true positives and true negatives divided by the number of true

positives, true negatives, false positives, and false negatives. A true positive or true negative is a

data point that the algorithm correctly classified as true or false, respectively. A false positive or

false negative, on the other hand, is a data point that the algorithm incorrectly classified. The

comparison of the classification algorithm results used in this study in terms of accuracy rate can

be seen in figure 7.

Figure 7: Accuracy rates comparison

F1 Score is the harmonic mean between precision and recall. The range of F1 score is in [0, 1],

where 1 is perfect classification and 0 is total failure. It tells you how precise your classifier is, as

well as how robust it is. In order to understand F1 score better, precision and recall needs to be

30

known. Precision is the number of correct positive results divided by the number of positive results

predicted by the classifier. It attempts to answer the following question “What proportion of

positive identifications was actually correct?” Recall is the number of correct positive results

divided by the number of all relevant samples. It attempts to answer the following question “What

proportion of actual positives was identified correctly?” F1 Score is a better measure to use in this

study, since there is an uneven class distribution and so the need is to seek a balance between

precision and recall.23 The dataset used in this study, have 30.8% of samples from Greece and 3.1%

of samples from Bulgaria. The formulas for F1 score, precision and recall as follows:

Precision = 𝑇𝑃

𝑇𝑃+𝐹𝑃 Recall =

𝑇𝑃

𝑇𝑃+𝐹𝑁 F1 Score = 2 ∗

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑥 𝑅𝑒𝑐𝑎𝑙𝑙

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙

The comparison of the classification algorithms’ results used in this study in terms of F1 score can

be seen in figure 8.

Figure 8: F1 Score Comparison

Other than performance measure rates, there are 4 more major results produced using the dataset

and the models. First of them uses the confusion matrices of classification algorithms and creates

error rankings table for country pairs. Second result focuses on which countries cause more error

in the results on average and determines 3 most erroneous countries for each country. Also, it

examines the results with respect to the distances between the capitals of each country. Third result

31

concentrates on regions by examining the misclassified samples of the models. The last result is

related with the usage of mclust package and its clusters.

4.2. Error rates and their rankings for country pairs

The main approach on this part was getting the error rates using confusion matrices of the models.

Error rates for each country has been calculated row by row. In order to make the strategy used

more understandable, a sample confusion matrix has been demonstrated in table 5 and it will be

used for detailed explanation.

Table 5: Confusion matrix of KNN model

The values of the matrix shown in table 5 have been divided by ‘All’ column to get error rates for

each of the pairs. Yet, the diagonal values have not been used, since they are not the part of error

rates. As an example, 4 of 51 Bulgaria samples has been predicted as Cyprus in KNN model, it

means the error rate for the Bulgaria-Cyprus pair is 0.078. Likewise, all error rates have been found

for all confusion matrices except logistic regression model in order to create final error rates table.

The reason why logistic regression (LR) model has been excluded is that the predictions of LR

model contains only 6 countries. Almost 50% of all samples have been predicted as Greece and

30% have been predicted as Italy. Moreover, F1 score and accuracy rate results of LR model are

explicitly worse than other models’ rates. Via the confusion matrices of 6 models, final error rates

table has been created and a sample of the table can be seen in table 6.

32

Table 6: A part of pair error rates table

In next step, these error rates have been ranked for each of the models. Using these error rankings,

a final average error ranking has been generated. In below table, if the ranking for a pair is smaller,

then the error rate for that pair is higher. A sample of ranking table can be seen in table 7.

Table 7: A part of error ranking table

4.3. Countries with their most erroneous pairs

In this part, first thing done was to obtain the coordinates of capitals of countries and find the

distances between them. The capital coordinates has been found on web and used via loading the

Excel file into Python. In the error ranking table, there are 56 different pair because of 8 countries.

The easiest but also meaningful result that could be produced using the distances between capitals

and error rankings for each pair was getting correlation coefficient. The correlation between these

two columns is 0.43. The sample of data used to calculate correlation coefficient can be seen in

table 8.

33

Table 8: A sample of the data used to create correlation coeeficient

The result states that the probability of occurring errors in the predictions of countries is higher, if

the countries are closer to each other, geographically. Also, the most erroneous 15 pairs in the error

rankings table have been examined to find out which countries’ samples have been faced off with

wrong predictions mostly. As a result, Bulgaria and Wales have found in the list 4 times while

England was a part of it, 3 times.

The final table created was to show each of the countries with their first three countries causing to

errors on predictions. The table can be seen in table 9. For example, the samples of England have

been predicted wrongly as Italy, Greece and Bulgaria at most, respectively.

Table 9: Countries with their most erroneous pairs

4.4. Misclassification rates for the regions

Same technique which has been used as getting the error rankings for country pairs has also been

applied for getting the misclassification rates for regions. Firstly, the number of misclassified

samples for every classification method except LR has been calculated region by region. These

numbers have been divided by the total number of samples for each region. Thus, the proportion

of misclassified samples for each region has been found for all the methods. As a next step, the

average of these rates has been calculated and combined with both the countries of regions and the

34

number of samples for each region to get interpretable table. A sample of the generated table can

be seen in table 10.

Table 10: The misclassification rates table for the regions

The mean column in table 10 shows the misclassification rates for the regions. Regions have also

been filtered according to their total number of samples. The reason was that if the number of

samples for a region is below certain threshold, it is highly possible to be misclassified. This

approach is similar to the one used while filtering the countries used in this study. The threshold to

filter the regions is 30. The reason of deciding this threshold value is related with a fact that the

total number of samples for the regions with more than 30 samples forms more than 50% of whole

data. The number of regions above the threshold is 23 approximately 5% of all regions. Vice versa,

that means 95% of regions form approximately 50% of all data but, these regions generally have

very few samples in each. In addition, the number of samples for each region can be examined

from table 4.

4.5. Clusters

The data has been clustered by using mclust package in R and gathered results have been loaded

into Python via Excel file. Any number of clusters have not been specified while grouping the

samples. As a result of the implementation, 9 clusters have been obtained. The number of samples

in each cluster can be seen in table 11.

35

Table 11: Clusters created by mclust package

By dividing the values on the table above into ‘All’ column, it has been found how much proportion

of each country fall into which cluster. For instance, 44 has been divided by 136 for Bulgaria and

it means that 32% of Bulgaria’s samples are in cluster 4. By this way, usable information has been

obtained for each cluster. As another example, cluster 3 has approximately 35% of Italy’s samples

and 24% of the samples of Spain. Moreover, the groups have been visualized on Europe map by

coloring the countries as red if more than 30% of their samples in that group and as orange if more

than 10% of their samples are in that group. The visualization of cluster 3 on map can be seen in

figure 16 and all clusters on map can be seen in Appendix 1. Also, the visualization of clusters

using scatter plot can be seen in figure 4 on the right.

36

Figure 9: The visualization of cluster 3 on map

All clusters are as follow:

Cluster 1 WALES, ITALY, Spain

Cluster 2 Cyprus, Greece, Turkey

Cluster 3 ITALY, Spain

Cluster 4 ENGLAND,BULGARIA, Italy, Wales

Cluster 5 Greece, Turkey

Cluster 6 BULGARIA, Greece, Italy, Spain, Turkey

Cluster 7 CYPRUS, Bulgaria

Cluster 8

Cluster 9 GREECE

* Countries with more than 30% of their samples written with CAPITAL letters and countries with

more than 10% of their samples written with small letters.

5. Discussions and Interpretations

Figure 7 and 8 show the accuracy rate and F1 score results for the models, respectively. As seen

from the results, four ensemble models which are KNN, bagging, random forests and adaboost

have performed better than other methods. Even though support vector machines have performed

at acceptable level with 71% for F1 score, classifying the data using logistic regression method has

not worked well. The accuracy rate for logistic regression model is 56% and F1 score is 48%.

37

5.1. Why logistic regression fails?

The reason why logistic regression model produces poor result has been searched, two articles with

same explanation have been found. It is about correlated inputs. An article says that “the model

can overfit if you have multiple highly-correlated inputs.”24 The data used in this article is the

proportion of 4 different lead isotope values with each other. Furthermore, as an example, the

correlation between 208Pb/206Pb and 206Pb/207Pb columns has been checked out and it is -

0.9391. The correlation between independent variables was the reason why naïve bayes

classification algorithm has not been used in this study in the first place and learning the reason

why logistic regression has failed will be one of the things learned during this research.

Although the results of accuracy rates and F1 scores are similar to each other, there are still some

important small variances worth mentioning. Firstly, four ensemble models have better F1 scores

when it is compared with their accuracy rates. Moreover, the F1 score of logistic regression is lower

than its accuracy rate. It can be noted that F1 score was successful performance measure for this

study by means of its known advantage which is seeking balance between precision and recall for

the datasets with uneven classes.

The table below on the left used before in the results section shows error rankings for country pairs

and created by using confusion matrices. As mentioned earlier, Wales, Bulgaria and England are

the countries with smaller error rankings which means they have higher error rates. However, these

countries have another common ground which is having less samples than the other countries. In

order to deal with the issue, unbalanced dataset, there are several approaches in literature.25 Two

of these approaches are downsampling and upsampling and they have been tried during the study.

However, the results obtained by using downsampling and upsampling approaches were worse for

all the models than the results obtained by applying the methods in the way mentioned through the

article. The F1 scores for the ensemble methods applied on oversampled and upsampled data have

been around 60-70% which is at least 15% less than the actual results. That’s why, appropriate

performance metric, F1 score, has been chosen as a way of handling with unbalanced classes. Even

if the countries with less samples have more misclassified samples, the overall success rates were

around 80% for the models applied on original dataset and it is acceptable for this study.

38

A part of error ranking table on the left and sample count information for countries on the right

Another remarkable table from the results section is the table showing countries with their 3 most

erroneous pairs. This table makes easier to interpret misclassified samples with respect to the

distances between countries. As seen from the table, the most erroneous pairs of 6 of 8 countries

are the ones closest to those countries. For example, Bulgaria’s the most erroneous pair is Greece.

Likewise, Cyprus’s first pair is Turkey. As mentioned earlier, there are some meaningful points

mentioned by Trincherini and Barbero about why should have been cautious as deciding the lead

provenance of an archeological artefacts. The misclassified samples of these 6 countries can be

related with two of the points which are the way of marking ingots could have become uniform

among different regions and the operators of mines moves to another region. These explanations

make more sense with these results.

There are only two exceptions for the “closest” approach among 8 countries which are England

and Wales. For them, the explanation seems different. Since, their most erroneous pair is Italy. The

reason makes sense for these countries is that the loading of boats in one region and moving to

another. In order to interpret misclassified samples more, examining the locations of regions with

higher error rates is another good way.

39

The most erroneous regions with highest error rates can be seen from figure 14. The first thing

drawing attention is error rates. The best methods which are bagging, adaboost and random forests

have 86% for F1 score and for others, the scores are worse than this value. With simple logic, it is

expected that the misclassification rates of many regions which are on the top of the list should

have higher error rates than 14%. However, only 2 regions have higher error rates than 14%. Then,

it means that the models misclassify samples from the regions with less samples.

Second important thing drawing attention is the locations of the regions with higher error rates.

The locations of 8 regions on the top of the list have been examined and their common ground is

being by the sea. Their locations have been shown on map and can be seen in Appendix 2. This

result can also be related to one of the explanations of Trincherini and Barbero which is the loading

of boats and their movement to another region with ingots. As, it can be stated that the samples

from the regions by the sea have more chance to be misclassified by the models.

Last interpretation is about the clusters. The countries of each of the clusters can be seen in clusters

section. The first thing realized that if England or Wales are in a cluster, then there is at least one

Italy or Spain in the cluster. Secondly, other clusters which are not including Wales or England

include the countries geographically close to each other. Cluster 2 and 3 are some examples of

second type clusters. Cluster 2 includes Cyprus, Greece and Turkey while cluster 3 includes Italy

and Spain. Because of the closeness of countries in the second type of clusters, they can also be

related with the reasons which are the way of marking ingots could have become uniform among

different regions and the operators of mines moves to another region.

6. Conclusion and Possible Future Work

This study has focused on the use of machine learning algorithms on the lead isotope ratio data to

predict the provenance of the ingots. Some of these algorithms are from supervised learning

approach to classify samples. Others are the part of unsupervised learning models and used to

reveal similar groups. Classification methods used in this study can also be grouped into 3 which

are ensemble methods, non-ensemble methods and neural networks. On the other hand, k-means

algorithm and mclust package of R have been used for clustering the data.

40

The data had 5 different lead isotope ratio, region and country columns. There was not much work

needed to clean the data, since it had only 20 rows with missing values. Also, 8 countries among

17 countries have been chosen to work on in order to increase the efficiencies of the models. By

this way, it has been ensured that both training and test datasets have samples from each of the

countries. Country column has been used as target label for the classification algorithms while 5

lead isotope ratio columns have been used as independent variables. Moreover, region and country

information of the samples have been used for reasoning the results. The locations of regions and

the distances of the countries to each other were some other significant approaches used through

the reasoning of the results.

The results of this study can be categorized into 5 pillars which are performance measure results,

error rates and their rankings for country pairs, countries with their most erroneous pairs,

misclassification rates for the regions and clusters. These results have been discussed and

interpreted by taking into consideration the reasons specified in the article written by Trincherini

and Barbero. These reasons were remarkable points to be cautious as deciding the lead provenance

of an archeological artefacts such as the movement of operators of mines between regions.

As a conclusion, this study was the first implementation of machine learning algorithms in the

context of predicting the provenance of archeological artefacts. Some of conventional

classificiation algorithms have been mainly used and their performance measures have been

observed. Although, neural networks has been applied as one of the methods, there are still some

space to apply them with different parameters, optimization functions and high epoch numbers to

increase their success rates.

41

References

[1] Trincherini, P., Barbero, P., Quarati, P., Domergue, C. and Long, L. (2001). Where Do the Lead

Ingots of the Saintes-maries-de-la-mer Wreck Come From? Archaeology Compared With Physics.

Archaeometry, 43(3), pp.393-406.

[2] Dodson, B. (2019). Archaeology vs. Physics: Conflicting roles for old lead. [online]

Newatlas.com. Available at: https://newatlas.com/relics-physics-archaeology-roman-lead/30032/

[Accessed 19 Aug. 2019].

[3] Bode, M., Hauptmann, A. and Mezger, K. (2009). Tracing Roman lead sources using lead

isotope analyses in conjunction with archaeological and epigraphic evidence—a case study from

Augustan/Tiberian Germania. Archaeological and Anthropological Sciences, 1(3), pp.177-194.

[4] Gomes, S., Araújo, M., Monge Soares, A., Pimenta, J. and Mendes, H. (2018). Lead provenance

of Late Roman Republican artefacts from Monte dos Castelinhos archaeological site (Portugal):

Insights from elemental and isotopic characterization by Q-ICPMS. Microchemical Journal, 141,

pp.337-345.

[5] En.wikipedia.org. (2019). Principal component analysis. [online] Available at:

https://en.wikipedia.org/wiki/Principal_component_analysis [Accessed 19 Aug. 2019].

[6] Scikit-learn.org. (2019). sklearn.decomposition.PCA — scikit-learn 0.21.3 documentation.

[online] Available at: https://scikit-

learn.org/stable/modules/generated/sklearn.decomposition.PCA.html [Accessed 19 Aug. 2019].

[7] En.wikipedia.org. (2019). Dependent and independent variables. [online] Available at:

https://en.wikipedia.org/wiki/Dependent_and_independent_variables [Accessed 19 Aug. 2019].

[8] Mohri, M., Rostamizadeh, A. and Talwalkar, A. (2012). Foundations of machine learning. 3rd

ed. p.7.

[9] Mitchell, T. (2017). Machine learning (1997). New York: McGraw Hill. pp. 55-58.

[10] Breiman, L., (1996) “Bagging predictors”. Machine Learning, 24:123-140.

42

[11] Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome (2008). The Elements of Statistical

Learning (2nd ed.). Springer. ISBN 0-387-95284-5.

[12] Ho, Tin Kam (1995). Random Decision Forests (PDF). Proceedings of the 3rd International

Conference on Document Analysis and Recognition, Montreal, QC, 14–16 August 1995. pp. 278–

282. Archived from the original (PDF) on 17 April 2016. Retrieved 5 June 2016.

[13] En.wikipedia.org. (2019). AdaBoost. [online] Available at:

https://en.wikipedia.org/wiki/AdaBoost [Accessed 19 Aug. 2019].

[14] Hastie, T. (2003). Boosting. [online] Web.stanford.edu. Available at:

https://web.stanford.edu/~hastie/TALKS/boost.pdf [Accessed 19 Aug. 2019].

[15] En.wikipedia.org. (2019). Support-vector machine. [online] Available at:

https://en.wikipedia.org/wiki/Support-vector_machine#cite_note-ReferenceA-13 [Accessed 19

Aug. 2019].

[16] En.wikipedia.org. (2019). K-nearest neighbors algorithm. [online] Available at:

https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm [Accessed 19 Aug. 2019].

[17] Scikit-learn.org. (2019). sklearn.linear_model.LogisticRegression — scikit-learn 0.21.3

documentation. [online] Available at: https://scikit-

learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html [Accessed 19

Aug. 2019].

[18] Yeh, C. (2019). Binary vs. Multi-Class Logistic Regression | Chris Yeh. [online]

Chrisyeh96.github.io. Available at: https://chrisyeh96.github.io/2018/06/11/logistic-

regression.html [Accessed 19 Aug. 2019].

[19] Garbade, M. (2018). Understanding K-means Clustering in Machine Learning. [online]

Medium. Available at: https://towardsdatascience.com/understanding-k-means-clustering-in-

machine-learning-6a6e67336aa1 [Accessed 19 Aug. 2019].

[20] Scrucca, L. (2019). A quick tour of mclust. [online] Cran.r-project.org. Available at:

https://cran.r-project.org/web/packages/mclust/vignettes/mclust.html [Accessed 19 Aug. 2019].

43

[21] Witten, I., Frank, E., Hall, M. “Data Mining: Practical Machine Learning Tools and

Techniques, 3rd Ed”.

[22] Cawley, Gavin C.; Talbot, Nicola L. C. (2010). "On Over-fitting in Model Selection and

Subsequent Selection Bias in Performance Evaluation" (PDF). 11. Journal of Machine Learning

Research: 2079–2107.

[23] Shung, K. (2019). Accuracy, Precision, Recall or F1?. [online] Medium. Available at:

https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9 [Accessed 19 Aug.

2019].

[24] Brownlee, J. (2019). Logistic Regression for Machine Learning. [online] Machine Learning

Mastery. Available at: https://machinelearningmastery.com/logistic-regression-for-machine-

learning/ [Accessed 19 Aug. 2019].

[25] Boyle, T. (2019). Dealing with Imbalanced Data. [online] Medium. Available at:

https://towardsdatascience.com/methods-for-dealing-with-imbalanced-data-5b761be45a18

[Accessed 19 Aug. 2019].

44

Appendix

1- The visualization of the cluster on map

Clusters Clusters on Map

Cluster 1

Cluster 2

45

Cluster 3

Cluster 4

46

Cluster 5

Cluster 6

47

Cluster 7

Cluster 9

48

2- The location of the regions with the highest error rates

Region Names Regions on Map

Gwynedd

Rhodope

49

Cornwall

Cumbria

50

Cyclades,

Siphnos

Peloponnese

51

Sardinia,

Iglesiente

Sardinia

statistical applications of classification algorithms

Documents