statistical applications of classification algorithms
TRANSCRIPT
Statistical Applications of Classification Algorithms Related to the Integration of Turkey
into the European Union
Author: Abdül Kuddüs Sünbül
16203822
The thesis is submitted to University College Dublin in part fulfillment of the requirements for
the degree of MSc Statistics
Supervisor: Dr. Riccardo Rastelli
August 2019
Table of Contents
List of Figures 4
List of Tables 5
Abstract 6
1. Introduction 7
1.1. Background and context 7
1.1.1. What's all about lead? 7
1.1.2. What are the lead isotope ratios and how is the data collected? 7
1.1.3. Reasons to be cautious 8
1.1.4. Relevant research questions and hypotheses 9
1.2. Motivation 9
1.2.1. Expected contributions to the area 9
1.2.2. Challenges 10
1.3. Research question 10
2. Exploratory Analysis and Comments 11
2.1. Data overview 11
2.2. Skewness 15
2.3. Histograms 16
2.4. The use of PCA and scatter plots 17
3. Theory and Methods 19
3.1. Understanding supervised and unsupervised learning 20
3.2. Statistical background and theoretical explanations of the methods 20
3.2.1. Decision Trees 21
3.2.2. Bagging 21
3.2.3. Random Forests 22
3.2.4. Boosting (Adaboost) 22
3.2.5. Support vector machines (SVM) 23
3.2.6. K-nearest neighbors (KNN) 23
3.2.7. Logistic Regression 24
3.2.8. Neural Networks 24
3.2.9. K-means 24
3.2.10. The usage of Mclust package in R 25
3.3. Application of methods 25
3.3.1. K-fold cross validation and train-test split 26
3.3.2. Grid search 27
3.3.3. Application in Python and R 27
3.3.4. The use of mclust and why it was used against k-means 27
3.3.5. Application of neural networks 28
4.1. Performance measures 29
4.2. Error rates and their rankings for country pairs 31
4.3. Countries with their most erroneous pairs 32
4.4. Misclassification rates for the regions 33
4.5. Clusters 34
5.1. Why logistic regression fails? 37
6. Conclusion and Possible Future Work 39
References 41
Appendix 44
4
List of Figures
Figure 1: Samples of ingots 2
Figure 2: The histograms of columns 9
Figure 3: The histograms of columns for each country 10
Figure 4: Scatter plot with variables obtained using PCA 11
Figure 5: Pairwise scatter plot 12
Figure 6: The graphs show the results of K-means clustering and Mclust package from left to right
20
Figure 7: Accuracy rates comparison 21
Figure 8: F1 Score Comparison 22
5
List of Tables
Table 1: Sample Data 4
Table 2: Summary statistics of the data 5
Table 3: Country sample statistics 6
Table 4: Region sample statistics 7
Table 5: Confusion matrix of KNN model 23
Table 6: A part of pair error rates table 23
Table 7: A part of error ranking table 24
Table 8: A sample of the data used to create correlation coeeficient 24
Table 9: Countries with their most erroneous pairs 25
Table 10: The misclassification rates table for the regions 26
Table 11: Clusters created by mclust package 27
6
Abstract
This study focuses on the use of lead isotope ratios of sample artefacts in machine learning context
to support the integration of European Union. The importance of lead and its isotope ratios come
from its distinctive feature. Especially in the last 20 years, the lead isotope ratios have become
prominent data to identify the provenance of an ingot. Since, they are always contained in a greater
or lesser quantity in normal metals. Conversion processes such as weathering or smelting is not to
be expected in such isotope systems and this situation makes using lead isotope ratios reliable in
studies. Also, there was no implementation of machine learning algorithms on a lead isotope ratio
dataset before. That’s why this field was a good choice in order to observe the success of machine
learning methods. These applications and their results are useful to reveal the integration of Turkey
and the European Union.
In this dissertation, a lead isotope ratio dataset consisting of 5 different isotope ratios, country and
region columns has been used in the implementation of the methods. Country column has been
used as target label for the classification algorithms. The classification algorithms used are decision
trees, bagging, adaboost, random forests, logistic regression, k-nearest neighbors and support
vector machines (SVM). Also, simple implementation of neural networks was another
classification approach that has been utilized. For the clustering part, k-means algorithm and mclust
package in R have been used to get the groups of the samples. The results of classification methods
have been compared each other with several performance metrics. Misclassified samples have been
used while interpreting the results.
The average success rates for ensemble models was approximately 86%. While SVM and decision
trees algorithms perform well, logistic regression model has poor results. The results of mclust
package’s application has been used in the interpretation of misclassified samples.
7
1. Introduction
1.1. Background and context
The origin of an antiquity brought to light by archaeology is one of the hardest things to solve. The
origin of metals, in particular, in the form of ingots, have normally been detected using the scientific
methods of antiquity. These methods such as the study of the content of the wreck and epigraphy
supply essential information. Over the past 20 years, archaeological science known as
archaeometry consisting of the application of scientific techniques to the analysis of archaeological
materials to assist in dating and origin of the materials took over the flag in the studies. In order to
analyze the origin of the wreck, chemical composition of the ingots and searching for trace
elements has gained importance. The lead isotope ratios have become prominent data to identify
the provenance of an ingot in these kind of studies. The ratios are always contained in a greater or
lesser quantity in normal metals.1
1.1.1. What's all about lead?
Why the lead is so much important in archaeology has been explained by Dodson in very simple
and persuasive way. He explains the situation with these sentences. “Lead, offering as it does a
convenient combination of density and formability, is the first line of defense for radiation
shielding. However, newly smelted lead contains a radioactive lead isotope, Pb-210, which is
generated in the decay of U-238. While the uranium and other radioactive elements are largely
removed during the smelting process, the Pb-210 remains, producing a low-level radioactive decay
(about 200 decays per kilogram per second) that restricts the ability of the most sensitive nuclear
and particle physics experiments to function. Pb-210, however, has a 22.3 year half-life. When lead
bars have lain underwater for 2,000 years, all of the Pb-210 has decayed, leaving "Roman lead" (or
old lead) with a radioactive level roughly 100,000 times lower than is found in new lead.”2
1.1.2. What are the lead isotope ratios and how is the data collected?
The explanation of lead makes more sense with a fact stated by Bode, Hauptmann and Mezger.
They say that “A shift in the isotope composition from conversion processes such as weathering or
smelting is not to be expected in such isotope systems.”3 Lead isotope ratios are also associated in
8
turn in different quantities in the mineral from which the lead is extracted, naturally variable
isotopic compositions are produced which make it possible to compare the isotopic compositions
of the minerals with those of the manufactured products. This is the reason why the ratio of isotopes
used is to get more sensitive indicators.1 The ratios generally used in the archaeometric context are
such as 206Pb/207Pb and 208Pb/206Pb. In the light of these explanations, the lead isotope ratios
of ingots can be used in the application of scientific techniques regardless of its age.
Figure 1: Samples of ingots
1.1.3. Reasons to be cautious
However, there are still some remarkable reasons to be cautious as deciding the lead provenance
of an archeological artefacts. Trincherini and Barbero explains these reasons on an example in their
article. The article states that “two explanations are possible. One is global: commercial practices
(the shape of the ingots, and the way of marking them) could have become uniform from one region
(Hispania) to another (Gallia). The second is more particular: operators of Spanish mines might
have moved to the south of Gaul to operate mines there, taking their working habits with them. The
last argument to be debated is that of the loading of the boat itself--Gallic ingots below, and
Baetican amphorae over them. On this point, one may imagine that there was a trade of
redistribution from a port of the Languedoc coast--Narbonne, Agde or Lattes--in which the ingots
that were produced locally would normally have been loaded first, followed by the Baetican oil
amphorae, which had been brought there previously by another ship.” These reasons explained on
9
an example make sense in general. Since, similar scenarios happen every day and so can be the
case in different situations. Therefore, it is expected that distinguishing some artefacts from each
other using their lead isotope ratio values will not be easy despite their different origins.
1.1.4. Relevant research questions and hypotheses
When the use of lead isotope ratios in archeology is searched on the literature, it can be seen that
there are some articles using lead isotope ratios as significant explanatory variables. Some of these
articles aim to reveal the success of lead isotope ratios to distinguish the provenance of
archeological artefacts. An article titled as “Provenance evidence for Roman lead artefacts of
distinct chronology from Portuguese archaeological sites” which has been published by Gomes,
Araújo, Soares and Correia in 2017 is an example of using lead isotope ratios to distinguish the
provenance of archeological artefacts.
This study focuses on the comparison of variations of lead isotope ratios on some materials found
at Alto dos Cacos and Conimbriga. The intention is to investigate whether the source of leaden raw
materials changed over the time and, if so, trying to identify those different sources.4 Some other
papers focus on to clarify the provenance of lead artefacts from military fortresses and camps
located in some regions. As known, lead was one of the significant metals used for military
purposes throughout history. These researches use the lead isotope ratio data for further statistical
analysis such as creating confidence intervals for regions or plotting them on coordinate plane to
demonstrate differences.
1.2. Motivation
1.2.1. Expected contributions to the area
During detailed literature review, although it is possible to predict the origins of archeological
samples using classification algorithms, the use of any machine learning technique on a lead
isotope ratio data has not been found. That’s why, using sensitive lead isotope ratio indicators with
different machine learning approaches has been chosen as a main focus of this article. In this study,
the origins of samples has been predicted using lead isotope ratios as independent variables in
10
classification algorithms. Some misclassified samples have been obtained with the implementation
of these methods. Moreover, clustering approach has been used to reveal the groups among the
artefacts. Both results obtained by classification and clustering algorithms have been examined by
considering possible relations between them. As interpreting the results, the drawbacks stated by
Trincherini and Barbero have been taken into consideration. As a result of all of these approaches,
the goal was to see the advantages and disadvantages of certain machine learning implementations
on a dataset which has not been used in this purpose before.
1.2.2. Challenges
A struggle of this project is that the dataset contains only lead isotope values. If there are prior
information used by archeologists to determine the origins such as epigraphy of the artefacts, they
could not be used in this study. The success of this study also depends on the environment and
technologies which have been used for measuring the lead isotope values.
1.3. Research question
The perspective in this study is if the provenances of ingots can be determined using the lead
isotope ratios in archeology, it should also be possible to identify the origins of these samples by
using efficient machine learning methods. Therefore, the question has been answered is whether
some of the machine learning applications are useful and successful to determine the origins of
sample artefacts using their isotope ratios. If the success rates of these algorithms are better than
being random, then the next step is to increase the level of the success and accuracy rates of the
implementations by using optimized hyper-parameters for the models. The results of the algorithms
have been compared with each other by using different performance measures and error rates. Also,
by means of clustering approach, possible connection between misclassified samples and regions
has been revealed and the results has been visualized on the map to make findings easily
understandable. Using these misclassified samples and clustering results, the possible reasons
remarked in other articles have been assessed.
11
2. Exploratory Analysis and Comments
2.1. Data overview
The data used in the study contains the isotopic ratios of lead in such forms 208Pb/206Pb and
207Pb/204Pb. There are country and region information for each sample. 4363 different
archeological artefacts with 5 different lead isotope ratios and country column as a target label have
been used to apply different machine learning methods. Region column has been used while
interpreting the results with respect to the misclassified samples of classification algorithms. A
sample of the dataset can be seen in table 1.
Table 1: Sample Data
Count
ry
Regi
on
208Pb/2
06Pb
206Pb/20
7Pb
206Pb/20
4Pb
207Pb/20
4Pb
208Pb/20
4Pb
0 Egypt Timn
a 2.07167 1.065417 18.800
17.64568
0
38.94739
6
1 Eire NaN 2.10380 1.094451 18.178 16.60923
9
38.24287
6
2 Eire Galw
ay 2.14261 1.117743 17.253
15.43556
9
36.96645
0
3 Egypt
Gaba
l El
Ineig
i
2.11090 1.139134 17.630 15.47667
2
37.21516
7
4 Italy Sardi
nia 2.12330 1.141279 17.812
15.60705
3
37.82022
0
These samples come from 17 countries and 460 regions. As seen from the table above there are
some NaNs in the dataset. First step was to examine the number of NaNs and decide how to handle
with them. 3 of 5 lead isotope ratio columns have 20 missing values, but these missing values are
in the same rows. 20 rows with missing values correspond to approximately 0.005% of whole data.
If the rows with missing values are ignored, any important information that can be obtained from
the dataset will not be lost. Therefore, they were just excluded from the dataset.
12
The new dataset has 4343 samples from 16 countries and 453 regions. Portugal had 1 sample with
missing values, so the country was excluded from the dataset completely. Table 2 shows the
summary statistics of the data.
Table 2: Summary statistics of the data
208Pb/206
Pb
206Pb/207
Pb
206Pb/204
Pb
207Pb/204
Pb
208Pb/204
Pb
count 4343 4343 4343 4343 4343
mean 2.081477 1.185914 18.569244 15.658071 38.644838
std 0.022437 0.019471 0.332109 0.071317 0.357808
min 1.635600 1.065417 17.253000 14.995782 36.966450
25% 2.069010 1.168681 18.290000 15.628000 38.367965
50% 2.078440 1.191909 18.622000 15.658252 38.666448
75% 2.098257 1.198804 18.807000 15.689925 38.943319
max 2.142610 1.465846 23.690000 17.645680 41.061193
204Pb values are relatively smaller than the other lead isotope values so, the ratios including 204Pb
values have bigger means. Mean and median values are close to each other for all columns. It helps
to state that the data has not many outliers. Also, especially for the 206Pb/207Pb column, minimum
and maximum values are not close to next quartile values and this issue has been assessed by means
of other plots created in next steps.
The next thing was to examine number of samples for each country and region to determine how
and which machine learning algorithms should be constructed based on the data. As seen from the
table 3, 8 countries which are Greece, Italy, Spain, Cyprus, Turkey, England, Wales and Bulgaria
forms 97.4% of whole samples. Only these 8 countries have been used in analyses. The reason to
choose these countries is there is breaking point in numbers after 8th country, Bulgaria, from 3.1%
13
to 1.1%. Thus, every country will surely be represented in each cross validated training sets and
test set.
Table 3: Country sample statistics
Cou
nt
Percentag
e
Cumulative
P.
Greece
1338 30.8% 30.8%
Italy 770 17.7% 48.5%
Spain 728 16.8% 65.3%
Cyprus 490 11.3% 76.6%
Turkey 361 8.3% 84.9%
Englan
d
261 6.0% 90.9%
Wales 146 3.4% 94.3%
Bulgari
a
136 3.1% 97.4%
Eire 47 1.1% 98.5%
Egypt 31 0.7% 99.2%
Scotlan
d
23 0.5% 99.7%
Syria 6 0.1% 99.9%
France 3 0.1% 99.9%
Algeria 1 0.0% 100.0%
Palestin
e
1 0.0% 100.0%
As seen from table 4, only 23 regions, approximately 5% of all regions, form more than 50% of all
samples. There are many regions with a few samples, so applying classification algorithms with
14
region column as target label would not be appropriate. That’s why country column with 8
countries was selected as target label in this study.
Table 4: Region sample statistics
Count Percentag
e
Cumulative
P.
Sardinia 468 10.9% 10.9%
Cyclades 304 7.1% 17.9%
Sierra Morena 185 4.3% 22.2%
Attica 142 3.3% 25.5%
Sud-Est 117 2.7% 28.3%
Province of Huelva 98 2.3% 30.5%
Larnaca 97 2.3% 32.8%
Province of
Almeria
96 2.2% 35.0%
Province of Sevilla 80 1.9% 36.9%
Solea 63 1.5% 38.3%
Burgas district 48 1.1% 39.5%
Sardinia,
Iglesiente
48 1.1% 40.6%
Limni 47 1.1% 41.7%
Crete 45 1.0% 42.7%
Cornwall 44 1.0% 43.7%
Gwynedd 43 1.0% 44.7%
Peloponnese 40 0.9% 45.7%
Dyfed
(Cardiganshire)
39 0.9% 46.6%
Cumbria 37 0.9% 47.4%
15
Seriphos,
Moutoulos
37 0.9% 48.3%
Cyclades, Siphnos 36 0.8% 49.1%
Rhodope 34 0.8% 49.9%
Othrys Mountains 31 0.7% 50.6%
2.2. Skewness
Before further analyses, the question whether the data has to be log-transformed or not needs to be
answered. Since, if the data is skewed, it is not proper to use in classification methods. In order to
check out the skewness, the histograms of all columns of original and log-transformed data have
been created. The results can be seen in figure 2. As seen, there are only a few outliers and there is
no need to deal with them.
Figure 2: The histograms of columns
16
2.3. Histograms
As seen, the histograms in both sides have similar shapes. It means there is no need to log-transform
the data, since the data is not skewed. Therefore, original dataset will be used for further analyses.
The important thing needs to be handled with is that the histograms have multiple peaks. The reason
of more than one peak could be the difference between the lead isotope ratios of different countries
or there could be differentiation between the values of regions of some countries. In order to
observe this issue, the histograms of every column for each country have been created and can be
seen in figure 3.
Figure 3: The histograms of columns for each country
It is observed that there are some countries such as Greece and England with similar samples and
they generally have one peak. On the other side, there are some countries especially Spain with
more than one peak. When Spain samples are clustered into two using K-means algorithm, the
results state that Spain samples are separated like east and west. Huelva, Seville, Cadiz are some
examples of west cluster and Almeria and Murcia are examples of east cluster in this result. It
17
means that map boundaries of today’s world may not overlap with historical facts. This situation
has been taken into consideration while examining the results of methods.
2.4. The use of PCA and scatter plots
Another approach for explanatory analysis was demonstrate samples of each country on scatter
plot. However, scatter plots need only two columns explaining the data. That’s why Principal
Component Analysis (PCA) has been used to decrease the number of columns from 5 to 2. PCA is
a statistical procedure that uses an orthogonal transformation to convert a set of observations of
possibly correlated variables into a set of values of linearly uncorrelated variables called principal
components. This transformation is defined in such a way that the first principal component has
the largest possible variance (that is, accounts for as much of the variability in the data as possible),
and each succeeding component in turn has the highest variance possible under the constraint that
it is orthogonal to the preceding components. The resulting vectors (each being a linear
combination of the variables and containing n observations) are an uncorrelated orthogonal basis
set.5
For this purpose, I have used PCA function of scikit-learn library of Python which reduces
dimensionality using singular value decomposition of the data to project it to a lower dimensional
space.6 Two components obtained as a result, explains 99.2% of the variance in the data. Therefore,
scatter plot used with these values are reliable to have an opinion about the distinction of countries
from each other. Figure 4 shows that the countries have not distinctly separated from each other,
so it is expected to have some groups in the dataset with samples from different countries.
18
Figure 4: Scatter plot with variables obtained using PCA
When the dataset has been examined with pairwise scatter plots in figure 5, similar result with PCA
has also been observed.
19
Figure 5: Pairwise scatter plot
3. Theory and Methods
The purpose of this study is to observe the success and usability of some statistical machine
learning algorithms on the lead isotope ratio data of archeological artefacts. These artefacts come
from 8 different countries and much more regions. In statistical modeling, the values of dependent
variables depend on the values of independent variables. The dependent variables represent the
output or outcome whose variation is being studied. The independent variables represent inputs
20
that is potential reasons for variation.7 In this case, independent variables are lead isotope ratios
and dependent variable is country information of the samples.
3.1. Understanding supervised and unsupervised learning
There are common learning scenarios in machine learning. These scenarios differ in the types of
training data available to the learner, the order and method by which training data is received and
the test data used to evaluate the learning algorithm. Two of the learning methods are applicable
for this study which are supervised and unsupervised learning methods. In supervised learning, the
learner receives a set of labeled examples (lead isotope ratios with country column as label) as
training data and makes predictions for all unseen points. This is the most common scenario
associated with classification problems and it will be used in this study. In unsupervised learning,
the learner exclusively receives unlabeled training data, and makes predictions for all unseen
points.8 In this study, although the data have labels, it will be ignored to apply clustering methods.
The purpose is to obtain groups regardless of the countries or regions in the first place. Thus, the
relation of the error rates for each country pair with these clusters will be interpreted.
3.2. Statistical background and theoretical explanations of the methods
In the area of classification, there are some algorithm types which are linear classifiers with the
examples of logistic regression and support vector machine, boosting algorithms, decision trees
and random forests using decision trees as base algorithm and neural networks. Among many
classification algorithms, 7 of them and neural networks have been used in this study. These 7
algorithms are decision trees, random forests, bagging, boosting (adaboost), support vector
machines, k-nearest neighbor and logistic regression. The reason of choosing these algorithms is
to apply wide range of classification algorithms compatible with the data. Also, two clustering
methods have been used which are k-means clustering and Finite Gaussian mixture modelling via
mclust package in R. In the next part, these models will be explained shortly with their statistical
background.
21
3.2.1. Decision Trees
Decision trees begins with the question "which attribute should be tested at the root of the tree?' To
answer this question, each instance attribute is evaluated using a statistical test to determine how
well it alone classifies the training examples. The statistical test which is called in other words as
information gain measures how well a given attribute separates the training examples according to
their target classification. Entropy is one way to characterize the (im)purity of an arbitrary
collection of examples for information gain. It is a set of examples S with class labels, where pi is
the relative frequency (probability) of each class and formulated as follows,
The best attribute is selected and used as the test at the root node of the tree. A descendant of the
root node is then created for each possible value of this attribute, and the training examples are
sorted to the appropriate descendant node (i.e., down the branch corresponding to the example's
value for this attribute). The entire process is then repeated using the training examples associated
with each descendant node to select the best attribute to test at that point in the tree.9
3.2.2. Bagging
Bagging is a short name for bootstrap aggregating and an ensemble learning method. It is a general
idea that can be used with any predictive method. It is very often used for classification.
Bagging works as follows:
1. Take a bootstrap sample of your data. Recall that approximately 0.632N (N is total number
of all samples) unique observations will be in the bootstrap sample.
2. Fit a classifier to the bootstrap sample. (In this study, classifier is decision trees)
3. Use the classifier on the non-bootstrap samples to assess the accuracy. This is called the
Out Of Bag prediction.
4. Repeat M times.
If there is a new sample that is wanted to be classified, it uses all of the classification trees
to classify it. The majority class is used to produce a final decision.10 So, it can be said that it is an
upgraded version of decision trees.
22
3.2.3. Random Forests
Random forests are also an ensemble learning method for classification trees (decision trees) that
operates by constructing a multitude of decision trees at training time and outputting the class that
is the mode of the classes of the individual trees.11 Random forests correct for decision trees' habit
of overfitting to their training set.12 Random forests create different datasets by sampling n training
examples from the original dataset with replacement and train decision trees on each newly created
dataset. How random forests work is same with bagging analogy till this point. However, while
sampling n training examples, independent variables are also sampled without replacement in
random forests. Again after training, predictions for unseen samples be made by taking the
majority. So, it can be said that random forests is an upgraded version of bagging.
The use of a different random subset of the variables at each split of the classification tree makes
the trees even more diverse than bagging algorithm.
3.2.4. Boosting (Adaboost)
The idea in boosting is that the output of the other learning algorithms, weak learners (decision
trees in this study) is combined into a weighted sum that represents the final output of the boosted
classifier. AdaBoost is adaptive in the sense that subsequent weak learners are tweaked in favor of
those instances misclassified by previous classifiers. That’s why AdaBoost can be sensitive to noisy
data and outliers. As all ensemble methods, the individual learners can be weak, but as long as the
performance of each one is slightly better than random guessing, the final model can be proven to
converge to a strong learner. When used with decision tree learning, information gathered at each
stage of the AdaBoost algorithm about the relative 'hardness' of each training sample is fed into the
tree growing algorithm such that later trees tend to focus on harder-to-classify examples.13 With
this aspect, Adaboost is completely different than other ensemble methods used in this study which
are bagging and random forests. Adaboost works algorithmically as follows14:
1. Initialize the observation weights: wi = 1/N, i = 1, 2,... ,N
2. For m = 1 to M repeat steps (a)–(d):
a. Fit a classifier Gm(x) to the training data using weights wi
b. Compute
23
c. Compute αm = log((1 − errm)/errm)
d. Update weights for i = 1,... ,N: and
renormalize wi to sum to 1
3. Output
3.2.5. Support vector machines (SVM)
Support vector machines are mathematically more complex method than the other methods used
in this study. Therefore, it will be explained with the simplest words as much as possible without
going into complex models. An SVM model is a representation of the examples as points in space,
mapped so that the examples of the separate categories are divided by a clear gap that is as wide as
possible. New examples are then mapped into that same space and predicted to belong to a category
based on the side of the gap on which they fall. The gaps are separated by hyperplanes which
represents the largest separation, or margin, between the two or more classes.15
In addition to performing linear classification, SVMs can efficiently perform a non-linear
classification using what is called the kernel trick, implicitly mapping their inputs into high-
dimensional feature spaces.11 In this study both type of SVMs have been applied via GridSearch
process of scikit-learn library of Python.
3.2.6. K-nearest neighbors (KNN)
K-nearest neighbors algorithm is one of easiest algorithms to understand. The idea is very simple.
In k-NN classification, the output is a class membership. An object is classified by a plurality vote
of its neighbors, with the object being assigned to the class most common among its k-nearest
neighbors (k is a positive integer).
A useful technique to increase success in k-NN can be to assign weights to the contributions of the
neighbors, so that the nearer neighbors contribute more to the average than the more distant ones.
24
For example, a common weighting scheme consists in giving each neighbor a weight of 1/d, where
d is the distance to the neighbor.16
3.2.7. Logistic Regression
In the multiclass case, the training algorithm in scikit-learn library of Python need to use the one-
vs-rest (OvR) scheme.17 In one-vs-rest, it is trained C separate binary classification models. Each
classifier fc, for c ∈ {1,…,C} is trained to determine whether or not an example is part of class c or
not. To predict the class for a new example x, it is run all C classifiers on x and choose the class
with the highest score: y =arg maxc ∈ {1,…,C} fc(x).
One main drawback is that when there are lots of classes, each binary classifier sees a highly
imbalanced dataset, which may degrade performance.18
3.2.8. Neural Networks
Neural networks is actually huge topic to explain in a paragraph. It is not just a machine learning
method like other methods used in this study. Its applications are called as deep learning in
literature. Therefore, it will be just defined here with the inspiration behind it. Artificial Neural
Networks (ANNs) which is a type of neural networks used in this study, are inspired by biological
nervous systems, such as the brain, where large numbers of interconnected neurons work together
to solve a problem. A neuron receives signals from other neurons through connections, called
synapses. The combination of these signals, in excess of a certain activation level, will result in the
neuron “firing” - i.e sending a signal on to other neurons connected to it. As a result of long chains
of computational stages a multi-class predictions are obtained with the help of activation function
at the end.
3.2.9. K-means
K-means clustering is one of the simplest and popular unsupervised machine learning algorithms.
To process the learning data, the K-means algorithm starts with a first group of randomly selected
centroids, which are used as the beginning points for every cluster. In next step, iterative
25
calculations are performed to optimize the positions of the centroids. It halts creating and
optimizing clusters when either:
- The centroids have stabilized — there is no change in their values because the clustering
has been successful.
- The defined number of iterations has been achieved.19
Moreover, clusters are assumed to be spherical and evenly sized. This situation can create problems
to use k-means as reliable clustering method in studies.
3.2.10. The usage of Mclust package in R
Mclust is a contributed R package for model-based clustering, classification, and density estimation
based on finite normal mixture modelling. It provides functions for parameter estimation via the
EM algorithm for normal mixture models with a variety of covariance structures, and functions for
simulation from these models. Also included are functions that combine model-based hierarchical
clustering, EM for mixture estimation and the Bayesian Information Criterion (BIC) in
comprehensive strategies for clustering, density estimation and discriminant analysis. Additional
functionalities are available for displaying and visualizing fitted models along with clustering,
classification, and density estimation results.20
This package have been used to get more reliable clusters for the data without having the same
issues of k-means algorithm.
3.3. Application of methods
The methods used in this study have been mainly applied on Python. The only application on R
was using mclust package for clustering the samples. In supervised learning, evaluating the
performance of methods is also important part of the study. For unsupervised learning methods,
there is no predictions made, so evaluating the models with performance measures is not the case.
Before applying machine learning methods, feature scaling has been applied to the dataset which
is a method used to normalize the range of independent variables or features of data. It is also
26
known in literature as data normalization and generally performed during the data preprocessing
step. Objective functions do not work properly without normalization in some machine learning
algorithms. Since, many classifiers calculate the distance between two points by the Euclidean
distance. By normalizing values, feature contributes approximately proportionately to the final
distance. Another benefit to apply feature scaling is that gradient descent converges much faster
with feature scaling than without it.
3.3.1. K-fold cross validation and train-test split
For classification problems, it is natural to measure a classifier’s performance in terms of the error
rate. The classifier predicts the class of each instance: if it is correct, that is counted as a success;
if not, it is an error. The error rate is just the proportion of errors made over a whole set of instances,
and it measures the overall performance of the classifier.21 For evaluating the performance, first
thing needs to be done is to set up train-test split for the dataset. The train and test datasets have
been constructed using 70-30% split in this study. Thus, the size of training set is 2961 and the size
of test set is 1269.
Another technique used while implementing the classification algorithms is k-fold cross validation.
The goal of k-fold cross validation is to test the model's ability to predict new data that was not
used in estimating it, in order to flag problems like overfitting or selection bias22 and to give an
insight on how the model will generalize to an independent dataset. The general procedure for k-
fold cross validation is as follows:
1. Split the dataset into k groups
2. For each unique group:
o Take the group as a hold out or test data set
o Take the remaining groups as a training data set
o Fit a model on the training set and evaluate it on the test set
Multiple rounds of cross-validation are performed and the validation results are averaged over the
rounds to give an estimate of the model's predictive performance.
27
3.3.2. Grid search
Grid search is an approach which can also be called as hyperparameter optimization. A model
hyperparameter is a characteristic of a model that is external to the model and whose value cannot
be estimated from data. These values have to be set before the learning process begins. For
example, k in k-Nearest Neighbors and the number of hidden layers in Neural Networks.
GridSearchCV function of scikit-learn library is useful for optimizing the parameters of the
estimator together with cross-validation.
3.3.3. Application in Python and R
All classification methods except neural networks have been applied using scikit-learn which is
machine learning library for Python. Neural networks has been applied using Keras library which
runs on top of TensorFlow. After applying classification methods, cross tabulation and confusion
matrices have been created using the best hyperparameters for each model. Using these matrices,
the results can easily be interpreted for each class.
3.3.4. The use of mclust and why it was used against k-means
In order to create clusters using mclust package, MclustDR function has been benefited from.
MclustDR function aims at reducing the dimensionality by identifying a set of linear combinations,
ordered by importance as quantified by the associated eigenvalues, of the original features which
capture most of the clustering or classification structure contained in the data.
As mentioned earlier, k-means clustering has some disadvantages. Clusters created by k-means
algorithm are assumed to be spherical and evenly sized. In order to visualize what is meant by
spherical and evenly sized, I have clustered original data into 7 groups using k-means. As seen
from figure 6 on the left, k-means algorithm groups samples close to each other with themselves.
On the other hand, mclust package has created clusters in different manner.
28
Figure 6: The graphs show the results of K-means clustering and Mclust package from left to right
3.3.5. Application of neural networks
Neural networks has been used in this study, briefly. Using GridSearchCV function different
hyperparameters have been applied at once such as adam and stochastic gradient descent
optimizers. Softmax has been used as an activation function, since the aim of the study is to predict
multiple classes in the dataset. There was one hidden layer in model structure with 200 nodes.
Epoch number which defines the number of times that the learning algorithm will work through
the entire training dataset was used as 1000. Also, the batch size is a hyperparameter that defines
the number of samples to work through before updating the internal model parameters and it was
used as 32. The reason why different numbers for epoch, hidden units and batch size were not used
is that the run time of this model was about half an hour. It starts with approximately %45 success
rate and ends with 72-73%. It means that the required duration to run simple neural network models
is much more than other machine learning methods.
If it is wanted to increase the success rate of neural networks on this dataset, I think it is possible
by using larger epoch numbers and different hyperparameters. Since, this article is the first one
applying machine learning methods on the lead isotope ratios, using classical methods such as
KNN and random forests was more preferable in the first place. Therefore, it was decided to frame
this study around conventional machine learning methods.
29
4. Results
4.1. Performance measures
Choosing appropriate performance measure is one of the most essential parts in machine learning
studies. The first performance metric comes to mind is classification accuracy to measure the
performance of the models, but only accuracy rate is not enough to truly judge the results. Precision
and recall are other important approaches to evaluate the results. Also, as a combination of
precision and recall, F1 score is beneficial to use and it has been used together with accuracy rate
to evaluate and compare the models in this study.
Accuracy is the number of correctly predicted data points out of all the data points. More formally,
it is defined as the number of true positives and true negatives divided by the number of true
positives, true negatives, false positives, and false negatives. A true positive or true negative is a
data point that the algorithm correctly classified as true or false, respectively. A false positive or
false negative, on the other hand, is a data point that the algorithm incorrectly classified. The
comparison of the classification algorithm results used in this study in terms of accuracy rate can
be seen in figure 7.
Figure 7: Accuracy rates comparison
F1 Score is the harmonic mean between precision and recall. The range of F1 score is in [0, 1],
where 1 is perfect classification and 0 is total failure. It tells you how precise your classifier is, as
well as how robust it is. In order to understand F1 score better, precision and recall needs to be
30
known. Precision is the number of correct positive results divided by the number of positive results
predicted by the classifier. It attempts to answer the following question “What proportion of
positive identifications was actually correct?” Recall is the number of correct positive results
divided by the number of all relevant samples. It attempts to answer the following question “What
proportion of actual positives was identified correctly?” F1 Score is a better measure to use in this
study, since there is an uneven class distribution and so the need is to seek a balance between
precision and recall.23 The dataset used in this study, have 30.8% of samples from Greece and 3.1%
of samples from Bulgaria. The formulas for F1 score, precision and recall as follows:
Precision = 𝑇𝑃
𝑇𝑃+𝐹𝑃 Recall =
𝑇𝑃
𝑇𝑃+𝐹𝑁 F1 Score = 2 ∗
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑥 𝑅𝑒𝑐𝑎𝑙𝑙
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
The comparison of the classification algorithms’ results used in this study in terms of F1 score can
be seen in figure 8.
Figure 8: F1 Score Comparison
Other than performance measure rates, there are 4 more major results produced using the dataset
and the models. First of them uses the confusion matrices of classification algorithms and creates
error rankings table for country pairs. Second result focuses on which countries cause more error
in the results on average and determines 3 most erroneous countries for each country. Also, it
examines the results with respect to the distances between the capitals of each country. Third result
31
concentrates on regions by examining the misclassified samples of the models. The last result is
related with the usage of mclust package and its clusters.
4.2. Error rates and their rankings for country pairs
The main approach on this part was getting the error rates using confusion matrices of the models.
Error rates for each country has been calculated row by row. In order to make the strategy used
more understandable, a sample confusion matrix has been demonstrated in table 5 and it will be
used for detailed explanation.
Table 5: Confusion matrix of KNN model
The values of the matrix shown in table 5 have been divided by ‘All’ column to get error rates for
each of the pairs. Yet, the diagonal values have not been used, since they are not the part of error
rates. As an example, 4 of 51 Bulgaria samples has been predicted as Cyprus in KNN model, it
means the error rate for the Bulgaria-Cyprus pair is 0.078. Likewise, all error rates have been found
for all confusion matrices except logistic regression model in order to create final error rates table.
The reason why logistic regression (LR) model has been excluded is that the predictions of LR
model contains only 6 countries. Almost 50% of all samples have been predicted as Greece and
30% have been predicted as Italy. Moreover, F1 score and accuracy rate results of LR model are
explicitly worse than other models’ rates. Via the confusion matrices of 6 models, final error rates
table has been created and a sample of the table can be seen in table 6.
32
Table 6: A part of pair error rates table
In next step, these error rates have been ranked for each of the models. Using these error rankings,
a final average error ranking has been generated. In below table, if the ranking for a pair is smaller,
then the error rate for that pair is higher. A sample of ranking table can be seen in table 7.
Table 7: A part of error ranking table
4.3. Countries with their most erroneous pairs
In this part, first thing done was to obtain the coordinates of capitals of countries and find the
distances between them. The capital coordinates has been found on web and used via loading the
Excel file into Python. In the error ranking table, there are 56 different pair because of 8 countries.
The easiest but also meaningful result that could be produced using the distances between capitals
and error rankings for each pair was getting correlation coefficient. The correlation between these
two columns is 0.43. The sample of data used to calculate correlation coefficient can be seen in
table 8.
33
Table 8: A sample of the data used to create correlation coeeficient
The result states that the probability of occurring errors in the predictions of countries is higher, if
the countries are closer to each other, geographically. Also, the most erroneous 15 pairs in the error
rankings table have been examined to find out which countries’ samples have been faced off with
wrong predictions mostly. As a result, Bulgaria and Wales have found in the list 4 times while
England was a part of it, 3 times.
The final table created was to show each of the countries with their first three countries causing to
errors on predictions. The table can be seen in table 9. For example, the samples of England have
been predicted wrongly as Italy, Greece and Bulgaria at most, respectively.
Table 9: Countries with their most erroneous pairs
4.4. Misclassification rates for the regions
Same technique which has been used as getting the error rankings for country pairs has also been
applied for getting the misclassification rates for regions. Firstly, the number of misclassified
samples for every classification method except LR has been calculated region by region. These
numbers have been divided by the total number of samples for each region. Thus, the proportion
of misclassified samples for each region has been found for all the methods. As a next step, the
average of these rates has been calculated and combined with both the countries of regions and the
34
number of samples for each region to get interpretable table. A sample of the generated table can
be seen in table 10.
Table 10: The misclassification rates table for the regions
The mean column in table 10 shows the misclassification rates for the regions. Regions have also
been filtered according to their total number of samples. The reason was that if the number of
samples for a region is below certain threshold, it is highly possible to be misclassified. This
approach is similar to the one used while filtering the countries used in this study. The threshold to
filter the regions is 30. The reason of deciding this threshold value is related with a fact that the
total number of samples for the regions with more than 30 samples forms more than 50% of whole
data. The number of regions above the threshold is 23 approximately 5% of all regions. Vice versa,
that means 95% of regions form approximately 50% of all data but, these regions generally have
very few samples in each. In addition, the number of samples for each region can be examined
from table 4.
4.5. Clusters
The data has been clustered by using mclust package in R and gathered results have been loaded
into Python via Excel file. Any number of clusters have not been specified while grouping the
samples. As a result of the implementation, 9 clusters have been obtained. The number of samples
in each cluster can be seen in table 11.
35
Table 11: Clusters created by mclust package
By dividing the values on the table above into ‘All’ column, it has been found how much proportion
of each country fall into which cluster. For instance, 44 has been divided by 136 for Bulgaria and
it means that 32% of Bulgaria’s samples are in cluster 4. By this way, usable information has been
obtained for each cluster. As another example, cluster 3 has approximately 35% of Italy’s samples
and 24% of the samples of Spain. Moreover, the groups have been visualized on Europe map by
coloring the countries as red if more than 30% of their samples in that group and as orange if more
than 10% of their samples are in that group. The visualization of cluster 3 on map can be seen in
figure 16 and all clusters on map can be seen in Appendix 1. Also, the visualization of clusters
using scatter plot can be seen in figure 4 on the right.
36
Figure 9: The visualization of cluster 3 on map
All clusters are as follow:
Cluster 1 WALES, ITALY, Spain
Cluster 2 Cyprus, Greece, Turkey
Cluster 3 ITALY, Spain
Cluster 4 ENGLAND,BULGARIA, Italy, Wales
Cluster 5 Greece, Turkey
Cluster 6 BULGARIA, Greece, Italy, Spain, Turkey
Cluster 7 CYPRUS, Bulgaria
Cluster 8
Cluster 9 GREECE
* Countries with more than 30% of their samples written with CAPITAL letters and countries with
more than 10% of their samples written with small letters.
5. Discussions and Interpretations
Figure 7 and 8 show the accuracy rate and F1 score results for the models, respectively. As seen
from the results, four ensemble models which are KNN, bagging, random forests and adaboost
have performed better than other methods. Even though support vector machines have performed
at acceptable level with 71% for F1 score, classifying the data using logistic regression method has
not worked well. The accuracy rate for logistic regression model is 56% and F1 score is 48%.
37
5.1. Why logistic regression fails?
The reason why logistic regression model produces poor result has been searched, two articles with
same explanation have been found. It is about correlated inputs. An article says that “the model
can overfit if you have multiple highly-correlated inputs.”24 The data used in this article is the
proportion of 4 different lead isotope values with each other. Furthermore, as an example, the
correlation between 208Pb/206Pb and 206Pb/207Pb columns has been checked out and it is -
0.9391. The correlation between independent variables was the reason why naïve bayes
classification algorithm has not been used in this study in the first place and learning the reason
why logistic regression has failed will be one of the things learned during this research.
Although the results of accuracy rates and F1 scores are similar to each other, there are still some
important small variances worth mentioning. Firstly, four ensemble models have better F1 scores
when it is compared with their accuracy rates. Moreover, the F1 score of logistic regression is lower
than its accuracy rate. It can be noted that F1 score was successful performance measure for this
study by means of its known advantage which is seeking balance between precision and recall for
the datasets with uneven classes.
The table below on the left used before in the results section shows error rankings for country pairs
and created by using confusion matrices. As mentioned earlier, Wales, Bulgaria and England are
the countries with smaller error rankings which means they have higher error rates. However, these
countries have another common ground which is having less samples than the other countries. In
order to deal with the issue, unbalanced dataset, there are several approaches in literature.25 Two
of these approaches are downsampling and upsampling and they have been tried during the study.
However, the results obtained by using downsampling and upsampling approaches were worse for
all the models than the results obtained by applying the methods in the way mentioned through the
article. The F1 scores for the ensemble methods applied on oversampled and upsampled data have
been around 60-70% which is at least 15% less than the actual results. That’s why, appropriate
performance metric, F1 score, has been chosen as a way of handling with unbalanced classes. Even
if the countries with less samples have more misclassified samples, the overall success rates were
around 80% for the models applied on original dataset and it is acceptable for this study.
38
A part of error ranking table on the left and sample count information for countries on the right
Another remarkable table from the results section is the table showing countries with their 3 most
erroneous pairs. This table makes easier to interpret misclassified samples with respect to the
distances between countries. As seen from the table, the most erroneous pairs of 6 of 8 countries
are the ones closest to those countries. For example, Bulgaria’s the most erroneous pair is Greece.
Likewise, Cyprus’s first pair is Turkey. As mentioned earlier, there are some meaningful points
mentioned by Trincherini and Barbero about why should have been cautious as deciding the lead
provenance of an archeological artefacts. The misclassified samples of these 6 countries can be
related with two of the points which are the way of marking ingots could have become uniform
among different regions and the operators of mines moves to another region. These explanations
make more sense with these results.
There are only two exceptions for the “closest” approach among 8 countries which are England
and Wales. For them, the explanation seems different. Since, their most erroneous pair is Italy. The
reason makes sense for these countries is that the loading of boats in one region and moving to
another. In order to interpret misclassified samples more, examining the locations of regions with
higher error rates is another good way.
39
The most erroneous regions with highest error rates can be seen from figure 14. The first thing
drawing attention is error rates. The best methods which are bagging, adaboost and random forests
have 86% for F1 score and for others, the scores are worse than this value. With simple logic, it is
expected that the misclassification rates of many regions which are on the top of the list should
have higher error rates than 14%. However, only 2 regions have higher error rates than 14%. Then,
it means that the models misclassify samples from the regions with less samples.
Second important thing drawing attention is the locations of the regions with higher error rates.
The locations of 8 regions on the top of the list have been examined and their common ground is
being by the sea. Their locations have been shown on map and can be seen in Appendix 2. This
result can also be related to one of the explanations of Trincherini and Barbero which is the loading
of boats and their movement to another region with ingots. As, it can be stated that the samples
from the regions by the sea have more chance to be misclassified by the models.
Last interpretation is about the clusters. The countries of each of the clusters can be seen in clusters
section. The first thing realized that if England or Wales are in a cluster, then there is at least one
Italy or Spain in the cluster. Secondly, other clusters which are not including Wales or England
include the countries geographically close to each other. Cluster 2 and 3 are some examples of
second type clusters. Cluster 2 includes Cyprus, Greece and Turkey while cluster 3 includes Italy
and Spain. Because of the closeness of countries in the second type of clusters, they can also be
related with the reasons which are the way of marking ingots could have become uniform among
different regions and the operators of mines moves to another region.
6. Conclusion and Possible Future Work
This study has focused on the use of machine learning algorithms on the lead isotope ratio data to
predict the provenance of the ingots. Some of these algorithms are from supervised learning
approach to classify samples. Others are the part of unsupervised learning models and used to
reveal similar groups. Classification methods used in this study can also be grouped into 3 which
are ensemble methods, non-ensemble methods and neural networks. On the other hand, k-means
algorithm and mclust package of R have been used for clustering the data.
40
The data had 5 different lead isotope ratio, region and country columns. There was not much work
needed to clean the data, since it had only 20 rows with missing values. Also, 8 countries among
17 countries have been chosen to work on in order to increase the efficiencies of the models. By
this way, it has been ensured that both training and test datasets have samples from each of the
countries. Country column has been used as target label for the classification algorithms while 5
lead isotope ratio columns have been used as independent variables. Moreover, region and country
information of the samples have been used for reasoning the results. The locations of regions and
the distances of the countries to each other were some other significant approaches used through
the reasoning of the results.
The results of this study can be categorized into 5 pillars which are performance measure results,
error rates and their rankings for country pairs, countries with their most erroneous pairs,
misclassification rates for the regions and clusters. These results have been discussed and
interpreted by taking into consideration the reasons specified in the article written by Trincherini
and Barbero. These reasons were remarkable points to be cautious as deciding the lead provenance
of an archeological artefacts such as the movement of operators of mines between regions.
As a conclusion, this study was the first implementation of machine learning algorithms in the
context of predicting the provenance of archeological artefacts. Some of conventional
classificiation algorithms have been mainly used and their performance measures have been
observed. Although, neural networks has been applied as one of the methods, there are still some
space to apply them with different parameters, optimization functions and high epoch numbers to
increase their success rates.
41
References
[1] Trincherini, P., Barbero, P., Quarati, P., Domergue, C. and Long, L. (2001). Where Do the Lead
Ingots of the Saintes-maries-de-la-mer Wreck Come From? Archaeology Compared With Physics.
Archaeometry, 43(3), pp.393-406.
[2] Dodson, B. (2019). Archaeology vs. Physics: Conflicting roles for old lead. [online]
Newatlas.com. Available at: https://newatlas.com/relics-physics-archaeology-roman-lead/30032/
[Accessed 19 Aug. 2019].
[3] Bode, M., Hauptmann, A. and Mezger, K. (2009). Tracing Roman lead sources using lead
isotope analyses in conjunction with archaeological and epigraphic evidence—a case study from
Augustan/Tiberian Germania. Archaeological and Anthropological Sciences, 1(3), pp.177-194.
[4] Gomes, S., Araújo, M., Monge Soares, A., Pimenta, J. and Mendes, H. (2018). Lead provenance
of Late Roman Republican artefacts from Monte dos Castelinhos archaeological site (Portugal):
Insights from elemental and isotopic characterization by Q-ICPMS. Microchemical Journal, 141,
pp.337-345.
[5] En.wikipedia.org. (2019). Principal component analysis. [online] Available at:
https://en.wikipedia.org/wiki/Principal_component_analysis [Accessed 19 Aug. 2019].
[6] Scikit-learn.org. (2019). sklearn.decomposition.PCA — scikit-learn 0.21.3 documentation.
[online] Available at: https://scikit-
learn.org/stable/modules/generated/sklearn.decomposition.PCA.html [Accessed 19 Aug. 2019].
[7] En.wikipedia.org. (2019). Dependent and independent variables. [online] Available at:
https://en.wikipedia.org/wiki/Dependent_and_independent_variables [Accessed 19 Aug. 2019].
[8] Mohri, M., Rostamizadeh, A. and Talwalkar, A. (2012). Foundations of machine learning. 3rd
ed. p.7.
[9] Mitchell, T. (2017). Machine learning (1997). New York: McGraw Hill. pp. 55-58.
[10] Breiman, L., (1996) “Bagging predictors”. Machine Learning, 24:123-140.
42
[11] Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome (2008). The Elements of Statistical
Learning (2nd ed.). Springer. ISBN 0-387-95284-5.
[12] Ho, Tin Kam (1995). Random Decision Forests (PDF). Proceedings of the 3rd International
Conference on Document Analysis and Recognition, Montreal, QC, 14–16 August 1995. pp. 278–
282. Archived from the original (PDF) on 17 April 2016. Retrieved 5 June 2016.
[13] En.wikipedia.org. (2019). AdaBoost. [online] Available at:
https://en.wikipedia.org/wiki/AdaBoost [Accessed 19 Aug. 2019].
[14] Hastie, T. (2003). Boosting. [online] Web.stanford.edu. Available at:
https://web.stanford.edu/~hastie/TALKS/boost.pdf [Accessed 19 Aug. 2019].
[15] En.wikipedia.org. (2019). Support-vector machine. [online] Available at:
https://en.wikipedia.org/wiki/Support-vector_machine#cite_note-ReferenceA-13 [Accessed 19
Aug. 2019].
[16] En.wikipedia.org. (2019). K-nearest neighbors algorithm. [online] Available at:
https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm [Accessed 19 Aug. 2019].
[17] Scikit-learn.org. (2019). sklearn.linear_model.LogisticRegression — scikit-learn 0.21.3
documentation. [online] Available at: https://scikit-
learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html [Accessed 19
Aug. 2019].
[18] Yeh, C. (2019). Binary vs. Multi-Class Logistic Regression | Chris Yeh. [online]
Chrisyeh96.github.io. Available at: https://chrisyeh96.github.io/2018/06/11/logistic-
regression.html [Accessed 19 Aug. 2019].
[19] Garbade, M. (2018). Understanding K-means Clustering in Machine Learning. [online]
Medium. Available at: https://towardsdatascience.com/understanding-k-means-clustering-in-
machine-learning-6a6e67336aa1 [Accessed 19 Aug. 2019].
[20] Scrucca, L. (2019). A quick tour of mclust. [online] Cran.r-project.org. Available at:
https://cran.r-project.org/web/packages/mclust/vignettes/mclust.html [Accessed 19 Aug. 2019].
43
[21] Witten, I., Frank, E., Hall, M. “Data Mining: Practical Machine Learning Tools and
Techniques, 3rd Ed”.
[22] Cawley, Gavin C.; Talbot, Nicola L. C. (2010). "On Over-fitting in Model Selection and
Subsequent Selection Bias in Performance Evaluation" (PDF). 11. Journal of Machine Learning
Research: 2079–2107.
[23] Shung, K. (2019). Accuracy, Precision, Recall or F1?. [online] Medium. Available at:
https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9 [Accessed 19 Aug.
2019].
[24] Brownlee, J. (2019). Logistic Regression for Machine Learning. [online] Machine Learning
Mastery. Available at: https://machinelearningmastery.com/logistic-regression-for-machine-
learning/ [Accessed 19 Aug. 2019].
[25] Boyle, T. (2019). Dealing with Imbalanced Data. [online] Medium. Available at:
https://towardsdatascience.com/methods-for-dealing-with-imbalanced-data-5b761be45a18
[Accessed 19 Aug. 2019].
44
Appendix
1- The visualization of the cluster on map
Clusters Clusters on Map
Cluster 1
Cluster 2
45
Cluster 3
Cluster 4
46
Cluster 5
Cluster 6
47
Cluster 7
Cluster 9
48
2- The location of the regions with the highest error rates
Region Names Regions on Map
Gwynedd
Rhodope
49
Cornwall
Cumbria
50
Cyclades,
Siphnos
Peloponnese
51
Sardinia,
Iglesiente
Sardinia