application of k-means and pca approaches to estimation of...
TRANSCRIPT
ORIGINAL ARTICLE
Application of K-means and PCA approaches to estimation of goldgrade in Khooni district (central Iran)
Neda Mahvash Mohammadi1 • Ardeshir Hezarkhani1 • Abbas Maghsoudi1
Received: 22 December 2016 / Revised: 12 February 2017 / Accepted: 13 April 2017 / Published online: 25 April 2017
� Science Press, Institute of Geochemistry, CAS and Springer-Verlag Berlin Heidelberg 2017
Abstract Grade estimation is an important phase of min-
ing projects, and one that is considered a challenge due in
part to the structural complexities in mineral ore deposits.
To overcome this challenge, various techniques have been
used in the past. This paper introduces an approach for
estimating Au ore grades within a mining deposit using
k-means and principal component analysis (PCA). The
Khooni district was selected as the case study. This region
is interesting geologically, in part because it is considered
an important gold source. The study area is situated
approximately 60 km northeast of the Anarak city and
270 km from Esfahan. Through PCA, we sought to
understand the relationship between the elements of gold,
arsenic, and antimony. Then, by clustering, the behavior of
these elements was investigated. One of the most famous
and efficient clustering methods is k-means, based on
minimizing the total Euclidean distance from each class
center. Using the combined results and characteristics of
the cluster centers, the gold grade was determined with a
correlation coefficient of 91%. An estimation equation for
gold grade was derived based on four parameters: arsenic
and antimony content, and length and width of the sam-
pling points. The results demonstrate that this approach is
faster and more accurate than existing methodologies for
ore grade estimation.
Keywords K-means method � Clustering � Principal
component analysis (PCA) � Estimation � Gold � Khooni
district
1 Introduction
Evaluation of ore resources is essential for the economic
planning of a mining project (Pham 1997). One of the
important parameters of resource evaluation is grade esti-
mation. Grade estimation plays a significant role in eval-
uating the economic viability of an ore body. There are
various methods to estimate grade, including those based
on distance and geostatistics (Hassani Pak and Sharafeddin
2005), and nonparametric estimation (Davis and Jalkanen
1988). Recently, clustering methods have been used
extensively in the earth sciences to categorize geochemical
data. This approach divides the data into clusters; within
each cluster, the similarity between data is maximized, and
in different clusters, it is minimized (Berkhin 2006). There
are no categories of original data; indeed, variables are not
divided into independent and dependent groups. Rather, the
approach seeks groups of similar data that can be organized
by the behavior of the data to achieve better results (Mal-
yszko and Wierzchon 2007). The clustering method is an
indirect method. It can be used without previous informa-
tion on the internal database structure and can illuminate
hidden patterns and promote the performance of direct
methods (Abolhassani and Salt 2005). K-means is one of
the most important and efficient clustering methods and is
widely used for data classification (Lloyd 1957; MacQueen
1967; Hartigan and Wong 1979). K-means clustering
allocates all data to k classes. Data within one class are
similar, while data in different classes are dissimilar.
Cluster analysis endeavors to minimize the average
& Ardeshir Hezarkhani
Neda Mahvash Mohammadi
Abbas Maghsoudi
1 Department of Mining and Metallurgical Engineering,
Amirkabir University of Technology, Tehran, Iran
123
Acta Geochim (2018) 37(1):102–112
https://doi.org/10.1007/s11631-017-0161-7
squared distance between data points. In the beginning,
random k clusters are selected, with each observed point
deposited in the gravity center of the classes (for example,
by applying Euclidean distance). This determines the
cluster center. There are many algorithms for k-means, but
the algorithms of Hartigan (1975) and MacQueen (1967)
are more common than the others. If Manhattan distance is
used, the mean of each cluster could double as the gravity
center. Algorithm results depend on the number of initial
clusters. Therefore, it is possible to choose the appropriate
results by using different initial k (Templ et al. 2008).
Sometimes, principal component analysis (PCA) is used
before cluster analysis in order to reduce the dimension-
ality of the data set (Ng et al. 2001; Yeung and Ruzzo
2001; Zha et al. 2001). Ding and He (2004) presented the
relationship between k-means and PCA. They also showed
that principal components are the ongoing solutions to the
discrete cluster membership indicators for k-means clus-
tering (Ding and He 2004). Consequently, new lower
bounds are achieved for the k-means function.
Multivariate statistics allows the discovery of geo-
chemical patterns of elements and the investigation of
several variables simultaneously. This method is more
frequently used in earth sciences due to its greater relia-
bility over univariate and bivariate statistical methods
(Howarth and Sinding-Larson 1983). The method’s
numerous applications include: determining geochemical
anomalies (Loska and Wiechuła 2003), remote sensing
studies (Loughlin 1991; Crosta and Rabelo 1993; Du and
Flower 2008), environmental studies (Loska and Wiechuła
2003), oil and gas studies (Prinzhofer et al. 2000; Pasadakis
et al. 2004), and geophysical studies (Sabeti et al. 2007).
PCA is one of the multivariate statistical methods based on
eigenvalues and eigenvectors. PCA detects directions with
the highest variation and moves the data to a new coordi-
nate system, reducing the dimension of the data set and
summarizing the data characteristics (Jolliffe 1986).
However, information about the source of the sample is
required, a non-trivial task due to the huge number of
sources and the complexity of in-ground networks. The
useful and popular methods of PCA and clustering make
use of data source analysis.
In this paper, PCA was applied for clustering and
reducing the dimension of original data and then k-means
was used to investigate the behavior of identified elements.
This process allowed for the estimation of gold ore grade.
2 Geology of the Khooni district
Anarak is the most important metallogenic province of
Iran, hosting numerous ore deposits and mineralized veins
in its magmatic and metamorphic upper Proterozoic rocks.
Previous studies in this area have identified a variety of
metallic and non-metallic deposits, including deposits of
lead, zinc, gold, silver, chromite, iron, manganese,
molybdenum, antimony, etc. The variety of mines in the
area has attracted a number of researchers. Khooni is one
of the most famous mines in the east of Anarak. It has been
an important source of gold (Mahvash Mohammadi et al.
2016). Khooni is 60 km northeast of Anarak and 270 km
from Esfahan and belongs to the Central Iran geological
zone, one of the most complex structural units of Iran due
to its position between the Iran and Turan Plates. Faults in
the study area include the Darune in the north, the Dehlim-
Baghet in the west, the Bashagard in the south, and the
Nahbandan in the east (Darvish 2011).
The first scientific study on the Khooni District was
performed by Adib (1972). He carried out multiple atomic
absorption analyses to determine the mineralogy. Adib
discovered a large amount of gold in the silica and car-
bonate veins in the study area; the average content of gold
was presented as 20 ppm (Adib 1972). The mineralization
is mainly vein and poly-metal (Nezampoor and Rasa 2005).
Promising mineral potential was signaled by the mineral-
ization and by study of remote sensing maps in the region.
According to Mahvash Mohammadi et al. (2016), gold
mineralization in the area is at a sufficient grade and ton-
nage to warrant exploratory investigation.
Active tectonism and faults have played a major role in
the tectonic structure, position of veins, position of igneous
rocks, alteration, and mineralization in this area (Bagheri
et al. 2007). Eocene magmatism and tectonism along with
multiple phases of metamorphism are collectively respon-
sible for the deposits.
Stratigraphy of the study area is from Precambrian to
Quaternary. Outcrops in the western portion of the area
mainly consist of Cambrian metamorphic units, while in
the east are Eocene volcanics and pyroclastics with a
dominant composition of andesite and trachey andesite cut
by monzonite dikes. Outcrops of Cretaceous limestone in
the northwest corner of the area unconformably overlie
older units. Low altitude and lowland areas are covered by
old alluvial terraces, plains sediments, and young alluvial
deposits. The oldest lithostratigraphic units in the region
are a series of metamorphic rocks including schist, quart-
zite, marble, and amphibolite with serpentine blocks from
Precambrian to lower Cambrian (Fig. 1; Heydarian
Dehkordi and Rassa 2011, 2012).
3 Principal component analysis
PCA is one of the most powerful multivariate techniques
for data analysis and processing (Jolliffe 1986) and is
frequently used in data analysis (Lin 2012) across many
Acta Geochim (2018) 37(1):102–112 103
123
fields, such as data compression, image processing, visu-
alization, exploratory data analysis, pattern recognition and
time series prediction (Tipping and Bishop 1999; Labib
and Vemuri 2005). PCA is applied to reduce dimensions
and obtain a smaller number of variables for input to fur-
ther analysis. In order to achieve this, PCA transforms
initial data into a new set of variables—the principal
components (Jolliffe 2002). The goal of PCA is to sum-
marize the features of the data. For example, PCA was used
to reduce and select the features of the reflectance spectra
of lunar soil samples (Xiaoya et al. 2009).
PCA is based on eigenvalues and eigenvectors (Hotell-
ing 1933). It works by identifying directions with the lar-
gest variances, then calculating the principal components,
which are linear combinations of the correlated initial
variables. The initial variables’ N-dimensional space is
transformed into new Cartesian coordinates of uncorrelated
variables in a P-dimensional space such that P is less than
N (P \ N).
As mentioned, PCA is a method for finding linear
combinations of the correlated initial variables to make a
new coordinate system. The new directions are along the
largest variance. The component with the largest variance
among all principal components is considered the main
(PC1). The second (PC2) has the largest possible inertia,
has lower variation than PC1, and is orthogonal to PC1.
The other components are computed as well. Assume PC1
is a linear combination X1 to Xn of the initial variables
(Hassani Pak and Sharafeddin 2005):
y1 ¼ PC1 ¼ a11x1 þ a12x2 þ . . .þ a1nxn ð1Þ
And in matrix form:
y1 ¼ ½a1�T ½x� ð2Þ
If aij coefficients (weights) of the linear combination
were large, the variance would increase significantly.
Therefore, the coefficients are limited as follows:
a211 þ a2
12 þ . . .þ a2ij ¼ 1 ð3Þ
And the variance can be calculated as follows:
Sy21 ¼ ½a1�T ½s�½a� ð4Þ
where [S] is the original variables’ covariance matrix and
aij are the initial variables’ coefficients. The main com-
ponent can be indicated as a vector [Y], the total of initial
variables as another vector [X], and their weights in the
Fig. 1 Geological map of Khooni district (Pourjabar 2005)
104 Acta Geochim (2018) 37(1):102–112
123
form of a matrix [A], with the relationship expressed as
follows:
½y� ¼ ½A�½x� ð5Þ
The number of computable principal components
depends on the number of correlated initial variables and the
maximum justifiable variance. PC1s usually maximize the
variation. Therefore, the number of main components can be
significantly reduced compared to the initial variables.
4 k-means algorithm
The k-means approach (Ball and Hall 1967; MacQueen
1967) is one of several clustering methods used in data
mining. K-means clustering is considered exclusive and
flat. Samples are clustered into a number of specific clus-
ters (k), calculating the sum of Euclidean distance to
minimize each sample from the center of its cluster (Chen
and Chien 2010). The k-means algorithm starts with a
certain number of clusters and endeavors to achieve cluster
centers that are the average points of their respective
clusters. The initial number of clusters is randomly selec-
ted. Each data point is assigned to one of the clusters based
on greater similarity, and new clusters are obtained. This
process can be repeated and in each replication, new cen-
ters are calculated by averaging the data and reassigning
data points to new clusters (Egozcue et al. 2003). The
processing algorithm is as follows:
1. There is a set of N data points in multidimensional
space.
2. An integer, k, points are chosen to be cluster centers.
3. Means of the clusters are calculated as centroids.
4. Sum of squares Euclidean distance is calculated from
the center of clusters.
5. Each sample is assigned to the cluster whose average
has the least within-cluster sum of squares. The goal is
to minimize the mean squared distance from each
sample to the nearest center.
J ¼Xk
j¼1
Xn
i¼1
xðjÞi � cj
������
2
ð6Þ
where cj is the centroid of the cluster and x is the data
in the cluster, and || || indicates distance.
6. Calculate new means based on the new clusters.
In this study, we used efficient criteria—known as
silhouette criteria—to specify the appropriate number of
clusters (k) in the data set. Rousseeuw (1986) presented
this method; the silhouette method can group all
samples well, even the ones located between clusters.
4.1 Definition of silhouette
Initially, all data were classified and put into k clusters. The
silhouette, S(i), is a measure of how well the data are
assigned to their respective clusters and can be computed
as follows (Rousseeuw 1986):
SðiÞ ¼ bðiÞ � aðiÞmaxfaðiÞ; bðiÞg ð7Þ
where a(i) is the average dissimilarity of the sample to all
other samples in the same cluster and b(i) is the minimum
average dissimilarity of the sample to all other clusters to
which the sample is not a member. Since b(i) depends on
comparison with other clusters, the number of clusters, k,
must be more than one. The equation can be written as:
SðiÞ ¼1� aðiÞ=bðiÞ; if aðiÞ\bðiÞ0; if aðiÞ ¼ bðiÞ
bðiÞ=aðiÞ � 1; if aðiÞ[ bðiÞ
8><
>:ð8Þ
where S(i), is between -1 and ?1. If S(i) is around ?1, the
sample is in an appropriate cluster; a negative S(i) indicates
the desired sample is in the wrong cluster; and an S(i) = 0
indicates that the desired sample fits equally in multiple
clusters.
5 Datasets
About 256 stream sediment samples were analyzed by
Inductively Coupled Plasma Mass Spectrometry for 32
elements by the fire assay method (Fig. 1). The distribu-
tions are not normal. The data, which were normalized by
taking logarithms, and their statistical characteristics are
presented in Table 1. We used MATLAB and SPSS soft-
ware to analyze through k-means and PCA the elemental
stream sediment data and ultimately to estimate the grade
of gold.
6 Results and discussion
6.1 Correlation coefficient analysis results
A correlation coefficient illustrates inter-element relation-
ships and correlations. It can reveal some interesting
information about the sources of metals (Rodrıguez et al.
2008). Spearman’s correlation coefficients rank as follows:
significant correlation (0.5–1.0 and -1.0 to -0.5) which
are bolded in Table 2, medium correlation (0.3–0.5 and
-0.5 to -0.3), low correlation (0.1–0.3 and -0.3 to -0.1),
and uncorrelated (0.0–0.09 and -0.09 to 0.0; Fei et al.
Acta Geochim (2018) 37(1):102–112 105
123
2014). The results of the Pearson correlation coefficients
between fifteen evaluated elements are in Table 2.
Correlation coefficients were calculated for the stream
sediment samples. These coefficients indicated a significant
positive correlation among some elements, such as Au–As
(0.810), Au–Sb (0.776), and Sb–As (0.795) which are
bolded in Table 2. The relationships between these ele-
ments indicate they might feasibly have the same source;
the geochemistry is similar across the study area (Mahvash
Mohammadi and Hezarkhani 2015). In addition, there are
relatively strong correlations between Cu, Mo, Pb, and Sb.
6.2 Principal component analysis
A scree plot (Fig. 2) shows that the eigenvalues of the first
four components are greater than one, and account for
about 85.27% of the total variance of data. With the criteria
of eigenvalue greater than 1 and percentage of variance as
explained in Table 3, four principal components were
selected, which are bolded in Table 3. The first coefficient
(F1) from fifteen components had the highest percentage of
variability in the area. PCA was used for the concentrations
of Au, As, Ag, Sb, Ba, Be, Co, Cr, Cu, Fe, Mo, Ni, Pb, Sn,
Table 1 Statistical
characteristics of elements in
the Khooni district (raw values)
Variable Observations Minimum Maximum Mean SD
Sb 256 0.700 16.900 1.360 1.141
As 256 5.200 152.000 14.592 12.426
Au 256 0.440 5740.000 24.684 158.647
Ag 256 0.010 1.490 0.067 0.142
Ba 256 349.000 542.000 430.004 36.466
Be 256 0.700 1.500 0.926 0.113
Co 256 7.900 35.600 13.480 4.279
Cr 256 56.000 1960.000 181.977 224.394
Cu 256 16.900 461.000 28.798 30.379
Fe 256 21,100.000 154,000.000 38,455.078 18,649.745
Mo 256 0.800 120.000 2.363 7.890
Ni 256 51.000 140.000 69.020 14.481
Pb 256 13.700 16,600.000 144.271 1079.558
Sn 256 0.700 6.100 1.273 0.434
Zn 256 41.300 3090.000 97.977 212.027
Table 2 Pearson correlation coefficients of elements in the Khooni district
Variables Sb As Au Ag Ba Be Co Cr Cu Fe Mo Ni Pb Sn Zn
Sb 1
As 0.795 1
Au 0.776 0.810 1
Ag 0.176 0.092 0.343 1
Ba -0.163 -0.032 -0.219 -0.411 1
Be 0.022 0.237 0.046 0.023 0.235 1
Co 0.169 0.382 -0.104 -0.242 0.241 0.183 1
Cr 0.127 0.191 -0.074 -0.104 0.113 -0.084 0.758 1
Cu 0.863 0.708 0.347 0.228 -0.157 0.146 0.342 0.251 1
Fe 0.148 0.249 -0.043 -0.145 0.188 0.087 0.836 0.863 0.301 1
Mo 0.686 0.715 0.327 0.239 -0.085 0.006 0.253 0.256 0.848 0.350 1
Ni 0.052 0.511 -0.047 -0.063 0.127 0.165 0.775 0.511 0.253 0.637 0.207 1
Pb 0.588 0.682 0.247 0.135 -0.124 -0.146 -0.001 0.014 0.934 0.063 0.969 -0.049 1
Sn 0.183 0.229 0.156 -0.002 0.007 0.157 0.434 0.426 0.275 0.476 0.193 0.329 0.079 1
Zn 0.793 0.738 0.349 0.036 -0.021 -0.069 0.392 0.320 0.964 0.399 0.908 0.251 0.960 0.275 1
Bold values indicate a strong relationship between these elements
106 Acta Geochim (2018) 37(1):102–112
123
and Zn to reduce dimensions and classification. PCA
results are presented for all four components in Table 4.
As shown in Tables 3 and 4, the first component (PC1)
accounted for 38.74% of the total variance with strong
loadings on Au (0.973), Sb (0.933), and As (0.870) and
mediocre loadings on Cu (0.652), Mo (0.625), Pb (0.646),
and Zn (0.658). As shown in Table 4, the amount of strong
loading, greater than 0.5, are bolded. The behaviors of
elements in this group were highly dependent, with a strong
positive correlation between them. PC2 accounted for
25.187% of the total variance, loaded heavily on Co
(0.922), Cr (0.836), Fe (0.864), Ni (0.747), and Sn (0.550).
PC3 had high loadings only on Ag (0.870) and accounted
for 13.019% of the total variance. PC4 accounted for about
8.331% of the total variance and had loadings on Ba
(0.701) and Be (0.781).
Figure 3 shows loading plots of elements in the space
defined by two components. As shown in Fig. 3a, As and
Sb are close to Au; Cu, Mo, and Zn are close to Pb; Ni, Cr,
Co, Fe, and Sn are in a group; and Ba and Be are near each
other. These groups suggest Au, As, and Sb have strong
relationships; other groups of elements suggest the distri-
bution of these elements is related. Likewise, in Fig. 3b,
Cr, Co, Fe, Sn, and Ni make a group—an association of
correlated elements. A good correlation exists between Ba
and Be (Fig. 3c). Ag is separate and does not fall into any
group. The PCA agreed with the cluster analysis, with each
producing strong clusters and the same element groupings.
Au, As, and Sb, which share paragenesis, are in a cluster
and have a strong relationship.
6.3 K-means results
In the present study, we used k-means to cluster stream
sediment samples of the Khooni area and calculate the
optimum k for three elements—gold, arsenic, and anti-
mony—according to the coordinates of the sampling points.
The silhouette criterion was used to determine the
number of clusters (k); k was varied from 3 to 10. Then, the
results were analyzed to select the optimum k. Figure 4
shows average silhouette value based on the number of
clusters. The cluster with the maximum average silhouette
value was selected as the optimum cluster. According to
the above text, if the value of the average silhouette is close
to 1, the samples have been clustered correctly. Figure 4a
exhibits the average silhouette values for arsenic and
antimony; clustering with k = 3 is selected as the optimum
k, because of the high associated average silhouette value,
0.8584. Similarly, k = 3 with 0.9647 average silhouette,
Fig. 2 Scree plot of elements in the Khooni district
Table 3 Eigenvalue and percentage of variance of elements in the Khooni district
F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 F13 F14 F15
Eigenvalue 5.511 3.478 1.653 1.250 0.840 0.761 0.502 0.330 0.247 0.153 0.115 0.068 0.051 0.026 0.014
Variability (%) 36.741 23.187 11.019 8.331 5.602 5.073 3.350 2.202 1.647 1.021 0.767 0.454 0.338 0.174 0.093
Bold values indicate a strong relationship between these elements
F Factor loadings
Table 4 Factor loadings of elements in Khooni district
F1 F2 F3 F4
Sb 0.933 -0.011 0.059 0.010
As 0.870 -0.027 0.164 0.073
Au 0.973 0.026 0.286 0.013
Ag 0.203 -0.111 0.870 0.199
Ba -0.068 0.238 -0.149 0.701
Be -0.025 0.189 0.163 0.781
Co 0.296 0.922 -0.002 0.001
Cr 0.257 0.836 -0.070 -0.248
Cu 0.652 -0.161 -0.032 0.007
Fe 0.339 0.864 -0.081 -0.104
Mo 0.625 -0.222 -0.064 0.042
Ni 0.231 0.747 0.104 0.085
Pb 0.646 -0.254 -0.083 0.009
Sn 0.139 0.550 0.119 -0.164
Zn 0.658 -0.179 -0.075 -0.028
Bold values indicate a strong relationship between these elements
F Factor loadings
Acta Geochim (2018) 37(1):102–112 107
123
and k = 4 with 0.8064 were chosen as the optimum
numbers of clusters for coupling gold and antimony
(Fig. 4b), and for coupling gold and arsenic (Fig. 4c),
respectively.
All data were classified by k-means and then the cen-
troids of each cluster were calculated. The diagram of
centroids is shown for gold and arsenic with k = 4
(Fig. 5a), for gold and antimony with k = 3 (Fig. 5b), and
for arsenic and antimony with k = 3 (Fig. 5c).
With increasing grade of gold in Fig. 5a, the grade of
arsenic increases and subsequently decreases. The best
regression is a quadratic curve of y ¼ �0:1158x2þ8:5502x� 4:756with a correlation coefficient R2 ¼ 0:9625.
The behavior of gold and antimony is similar to the
behavior of gold and arsenic (Fig. 5b); with increasing gold
grade, the grade of antimony initially increases and later
decreases. The best regression is a quadratic curve with a
negative concavity y ¼ �0:0012x2 þ 0:0928xþ 1:134and
a correlation coefficient R2 ¼ 1. The behavior of arsenic
and antimony is predictable according to previous results.
As a result, we expect these two elements to have a direct
relationship. The grade of arsenic increases with increased
grade of antimony. The best regression plotted on centroids
was y ¼ 0:0274xþ 0:9713with a correlation coefficient
R2 ¼ 0:9992(Fig. 5c).
Fig. 3 Loading by plot of elements Khooni district
Fig. 4 Changes in the value of S(i) based on the number of clusters
a arsenic and antimony, b gold and antimony, c gold and arsenic
108 Acta Geochim (2018) 37(1):102–112
123
Fig. 5 The best regression
curve a k = 4 classifications
relating to Au and As; b k = 3
classifications relating to Au
and Sb; c k = 3 classifications
relating to Sb and As
Acta Geochim (2018) 37(1):102–112 109
123
7 Estimating the grade of gold
Estimating the gold grade depends on the behavior of gold,
arsenic, and antimony related to the sample coordinate. In
the first step in determining this relationship, the optimum
number of clusters was selected by silhouette criteria. The
highest value of the average silhouette was 0.6568,
belonging to Class 5 (Fig. 6). The coordinates of the
samples along grade elements of gold, arsenic, and anti-
mony were used as input data. The range of coordinate and
grade values are thus different. To avoid creating errors in
calculations and to obtain the correct estimation, all input
values were placed in a standard range—the interval [0,
1]—by Eq. 9.
Xnorm ¼x� xmin
xmax � xmin
ð9Þ
The specific cluster centers are given in Table 5 for k = 5.
Using multivariate regression in SPSS software to
determine the relationship between gold, arsenic, and
antimony according to coordinate sampling in the study
area, the grade of gold was estimated. Values of gold (the
dependent variable) and the amount of arsenic and anti-
mony, and length and width of samples (the independent
variables) were introduced into the application.
Specification and multivariate regression coefficients were
calculated and are reported in Table 6.
According to the coefficients in Table 6, the formula of
multiple regressions was determined as follows:
Au ¼ 1:995As� 14:892Sbþ 0:221X � 0:375Y þ 0:561
ð10Þ
The R2 suggests the regression model could explain the
changes to Au. At this point, R2 = 73.72, thus 73% of
changes to the gold value (y) are based on the value of
X (arsenic, antimony, length and width of the sample).
To validate the estimation of gold, some original data
were estimated based on Eq. 10 to describe the accuracy.
Some data were validated (Fig. 7). We randomly selected
30% of the samples and estimated the grade of gold based
using the grade of arsenic and antimony and length and
Fig. 6 Changes in the value of S(i) based on the number of clusters
(for gold, arsenic, and antimony)
Table 5 Specification of cluster centroids (clustering by k = 5)
Centroid number Gold grade (ppb) Arsenic grade (ppm) Antimony grade (ppm) Length (m) Width (m) Gold grade (ppb)
First 0.064 0.028 0.033 0.889 0.688 0.064
Second 0.017 0.157 0.061 0.708 0.321 0.017
Third 0.010 0.038 0.036 0.552 0.562 0.010
Fourth 0.006 0.060 0.046 0.339 0.199 0.006
Fifth 0.006 0.049 0.028 0.167 0.727 0.006
Table 6 Specifications and multivariate regression coefficients
a4 a3 a2 a1 b R2 R
-0.375 0.221 -14.892 1.995 0.561 0.7372 0.856
Fig. 7 Scatter plot of the measured gold values against estimated
values for validation
110 Acta Geochim (2018) 37(1):102–112
123
width of the samples. The computed values were then
compared with original values, resulting in a correlation
coefficient of 91% (Fig. 7). The estimated values of gold
almost match the original values of gold, demonstrating the
accuracy of the applied method.
8 Conclusion
Due to the evidence of gold mineralization in the Khooni
district, estimating Au concentration is important. Ele-
ments were divided into five groups including Au, As, and
Sb; Fe, Mo, Cu, and Pb; Ni, Cr, Co, and Sn; Be and Ba; and
Ag in a group of its own. Based on paragenesis of Au with
As and Sb and the results of our analysis, As and Sb were
considered for estimating Au concentration. District
Through PCA and k-means, an equation was achieved for
estimating the gold grade based on arsenic, antimony, and
length and width of the samples: Au ¼ 1:995As�14:892Sbþ 0:221X � 0:375Y þ 0:561with a correlation
coefficient of 91%.
Acknowledgements The authors thank the editors and anonymous
reviewers for their constructive comments, which have significantly
improved the manuscript.
References
Abolhassani B, Salt JE (2005) A simplex K-means algorithm for
radio-port placement in cellular networks. In: Canadaian Con-
ference on Electrical and Computer Engineering
Adib D (1972) Khooni mine mineralogy, Ph.D. Thesis, University of
Shiraz, Shiraz (in Persian)
Bagheri H, Moore F, Alderton DHM (2007) Cu–Ni–Co–A-(U)min-
eralization in the Anarak area of central Iran. J Asian Earth Sci
29:651–665
Ball GH, Hall DJ (1967) A clustering technique for summarizing
multivariate data. Behav Sci 12(2):153–155
Berkhin P (2006) A survey of clustering data mining techniques. In:
Kogan J, Nicholas C, Teboulle M (eds) Grouping multidimen-
sional data. Springer, Berlin, pp 25–71
Chen TW, Chien SY (2010) Bandwidth adaptive hardware architec-
ture of K-means clustering for video analysis. IEEE Trans VLSI
Syst 18(6):957–966
Crosta AP, Rabelo A (1993) Assessing landsat/TM for hydrothermal
mapping in central western, Brazil. In: processing of the 9th
thematic conference of geologic remote sensing
Darvish M (2011) Mineralogy studies and determine Khooni skarn
sources-North East of Anarak, Esfahan province. M.Sc. Thesis,
University of Esfahan, Esfahan, p 153 (in Persian)
Davis BM, Jalkanen GJ (1988) Nonparametric estimation of multi-
variate joint and conditional spatial distributions. Math Geol
20(4):367–381
Ding C, He X (2004) K-means clustering via principal component
analysis. In Appearing in proceedings of the 21st international
conference on machine learning, Banff
Du Q, Flower EJ (2008) Low-complexity principal component
analysis for hyperspectral image compression. Int J High
Perform Comput Appl 22(4):438–448
Egozcue JJ, Pawlowsky-Glahn V, Mateu-Figueras G (2003) Isometric
logratio transformations for compositional data analysis. Math
Geol 35:279–300
Fei Q, Hongbing J, Qian L, Xinyue G, Lei T, Jinguo F (2014)
Evaluation of trace elements and identification of pollution
sources in particle size fractions of soil from iron ore areas along
the Chao River. J Geochem Explor 138:33–49
Hartigan JA (1975) Clustering algorithms (probability & mathemat-
ical statistics). Wiley, London
Hartigan JA, Wong MA (1979) A K-means clustering algorithm. J R
Stat Soc 28(1):100–108
Heydarian Dehkordi N, Rassa A (2011) Study of alteration and
investigating genetic between gold and other elements in khooni
district. J Appl Geol Iran 2:95–106 (in Persian)Heydarian Dehkordi N, Rassa A (2012) Characteristics and genesis of
gold mineralization in Eocene volcanic units of khooni Chesh-
meh, Anarak, the nature of mineralizing fluids and comparison
with other types of gold deposits. J Geol Iran 17:73–85 (inPersian)
Hotelling H (1933) Analysis of a complex of statistical variables into
principal components. J Educ Psychol 24:417–441
Howarth RJ, Sinding-Larson R (1983) Statistic and data analysis in
geochemical prospecting. Handb Explor Geochem Amesterdam
2:207–289
Jolliffe IT (1986) Principal component analysis. Springer, New York
Jolliffe I (2002) Principal component analysis for special types of
data. In: Principal component analysis, 2nd edn. Springer, New
York, pp 338–372
Labib K, Vemuri V (2005) Application of exploratory multivariate
analysis for network security. In: Vemuri V (ed) Enhancing
computer security with smart technology. CRC Press, Boca
Raton, pp 229–261
Lin JW (2012) Study of ionospheric anomalies due to impact of
typhoon using principal component analysis and image process-
ing. J Earth Syst Sci 121(4):1001–1010
Lloyd S (1957) Least squares quantization in pcm. Bell Telephone
Laboratories Paper, Marray Hill
Loska K, Wiechuła D (2003) Application of principal component
analysis for the estimation of source of heavy metal contami-
nation in surface sediments from the Rybnik Reservoir. J Che-
mosphere 51:723–733
Loughlin WPG (1991) Principal component analysis for alteration
mapping. Photogram Eng Remote Sens 57(9):1163–1169
MacQueen J (1967) Some methods for classification and analysis of
multivariate observations. In: Proceedings of the 5th Berkeley
symposium on mathematical statistics and probability, Califor-
nia, vol 1, pp 281–297
Mahvash Mohammadi N, Hezarkhani A (2015) Estimation of grade
gold in Khooni deposit using the behavior of gold, Arsenic and
Antimoney elements by clustering K-means method. J Anal
Numer Methods Min Eng 5(10):77–92 (in Persian)Mahvash Mohammadi N, Hezarkhani A, Shokouh Saljooghi B (2016)
Separation of a geochemical anomaly from background by
fractal and U-statistic methods, a case study: Khooni district,
Central Iran. Chem Erde/Geochem 76:491–499
Malyszko D, Wierzchon ST (2007) Standard and genetic K-means
clustering techniques in image segmentation. In: 6th Interna-
tional conference on computer information systems and indus-
trial management applications
Nezampoor H, Rasa A (2005) Study of gold mineralization in oxide
veins Khooni–Anarak region. In The twenty-fourth national
geosciences symposium (in Persian)
Ng A, Jordan M, Weiss Y (2001) On spectral clustering: Analysis and
an algorithm. In: Proceedings of the neural information
processing systems
Acta Geochim (2018) 37(1):102–112 111
123
Pak HAA, Sharafeddin M (2005) Exploration data analysis. Tehran
University Press, Tehran (in Persian)Pasadakis N, Obermajer M, Osadetz KG (2004) Definition and
characterization of petroleum compositional families in Willis-
ton Basin, North America using principal component analysis.
J Org Geochem 35:453–468
Pham TD (1997) Grade estimation using fuzzy-set algorithms. Math
Geol 29(2):291–305
Pourjabar A (2005) Geochemical investigations on the polymetalic
vein in Khooni (Esfahan Province), M. Sc Thesis, Amirkabir
University of Technology, Tehran, p 217 (in Persian)
Prinzhofer A, Mello MR, Da Sila Freitas LC, Takaki T (2000) A new
geochemical characterization of natural gas and its use in oil and
gas evaluation. In: Mello MR, Katz BJ (eds) Petroleum systems
and south Atlantic Margins. American Association of Petroleum
Geologists Bulletin, Memoir 70:107–119
Rodrıguez JA, Nanos N, Grau JM, Gil L, Lopez-Arias M (2008)
Multiscale analysis of heavy metal contents in Spanish agricul-
tural topsoils. Chemosphere 70:1085–1096
Rousseuw PJ (1987) Silhouettes: a graphical aid to the interpretation
and validation of cluster analysis. Comput Appl Math 20:53–65
Sabeti H, Javaherian A, Araabi ND (2007) Principal component
analysis applied to seismic horizon interpretations. International
Congress of Petroleum Geostatistics, Cascais, pp 10–14
Templ M, Filzmoser P, Reimann C (2008) Cluster analysis applied to
regional geochemical data: problems and possibilities. Appl
Geochem 23(8):2198–2213
Tipping ME, Bishop CM (1999) Probabilistic principal component
analysis. J R Stat Soc B 61:611–622
Xiaoya Z, Chunlai L, Chang L (2009) Quantification of the chemical
composition of lunar soil in terms of its reflectance spectra by
PCA and SVM. Acta Geochim 28:204–211
Yeung KY, Ruzzo WL (2001) Principal component analysis for
clustering gene expression data. Bioinformatics 17:763–774
Zha H, He X, Ding C, Simon H, Gu M (2001) Spectral relaxation for
K-means clustering. Technical Report TR-2001-XX, Pennsylva-
nia State University, University Park, PA
112 Acta Geochim (2018) 37(1):102–112
123