applying improved clustering algorithm into ec environment data mining
TRANSCRIPT
Applying improved clustering algorithm into EC environment
Data mining
Yupeng Ma1, a, Bo Ma1,b and Tonghai Jiang1,c 1 Research Center for Multilingual Information Technology
Xinjiang Technical Institute of Physics &
Chemistry, Chinese Academy of Sciences
Urumqi, Xinjiang Province, China
Keywords: EC environment, Customer segmentation, K-means, improved K-means.
Abstract. With the rising growth of electronic commerce (EC) customers, EC service providers are
keen to analyze the on-line browsing behavior of the customers in their web site and learn their
specific features. Clustering is a popular non-directed learning data mining technique for partitioning
a dataset into a set of clusters. Although there are many clustering algorithms, none is superior for the
task of customer segmentation. This suggests that a proper clustering algorithm should be generated
for EC environment. In this paper we are concerned with the situation and proposed an improved
k-means algorithm, which is effective to exclude the noisy data and improve the clustering accuracy.
The experimental results performed on real EC environment are provided to demonstrate the
effectiveness and feasibility of the proposed approach.
Introduction
In electronic commerce environment, Customer Relationship Management (CRM) can be defined
as the process that manages and supports the interactions between a company and its customers. A
general framework for a CRM system would include analytical components that are customer focused
components, used to analyze existing customer data. Another component would be operational
component that includes functional tools used to achieve customer centric goals and finally customer
contact component that includes channels or medium used to interact with the customer directly.
CRM helps locate relevant information about customers and use it to market the product to vital
segments of customers and be able to obtain feedback about how successfully the customer needs
were satisfied [1].
CRM encompasses all activities carried out with the customer in focus, from the business point of
view and customer point of view. Along with Information Systems automation, sales force and
marketing system automation and infrastructure development, an important element of CRM is
analytics.
Analytics involve use of several scientific techniques to analyze customer data that is available,
and make use of it to derive predictive conclusions about the information, that will then help to make
business decisions. The biggest hurdle faced in analytics of this nature is not the lack of information,
but more so the abundance of it and the failure to utilize all of this data to derive something
meaningful. As technological improvements have been made to CRM tools, many traditional,
statistical tools have been incorporated into the system, to assist in carrying out analysis of this nature.
Options available currently facilitate the use of more than one tool at a time, to come up with a
solution most suitable for our needs. Thus, integrating useful, predictive tools is vital for CRM to
derive information from existing information.
Applied Mechanics and Materials Vol. 596 (2014) pp 951-959 Submitted: 07.05.2014Online available since 2014/Jul/18 at www.scientific.net Accepted: 13.05.2014© (2014) Trans Tech Publications, Switzerlanddoi:10.4028/www.scientific.net/AMM.596.951
All rights reserved. No part of contents of this paper may be reproduced or transmitted in any form or by any means without the written permission of TTP,www.ttp.net. (ID: 130.194.20.173, Monash University Library, Clayton, Australia-05/12/14,19:54:32)
The collective use of statistical tools to extract customer information, analyze it and infer customer
behavior patterns, based on it is termed as Data mining. In all systems developed in the recent
times,data mining techniques are an integral part of CRM. By definition, data mining refers to
extracting or mining knowledge from large amounts of data.
Web Data Mining
The rising of data mining technology has solved the problem of complex customer segmentation.
Data mining is the process of extracting potentially useful information, finding people concerned
pattern and hidden relationships between data and variation from massive fuzzy random data, which
allows people to recognize data from a higher level and mine potential value to provide support for
decision-making. Data mining technologies can help companies to find the potential customers’
information from massive data, to transform customers’ information in the database into useful
characteristic information and cluster the similar customers, and thus evaluating valuable customers
and provide different services to different customers.
Web Mining. The web mining can be broadly categorized as Web Content Mining [2], Web
Structure Mining, and Web Usage Mining and shown in Fig. 1.
(1) Web Content Mining deals with the discovery of multimedia documents, involving texts,
hypertexts, images, audio and video information and their automatic categorization.
(2) Web Structure Mining deals with the finding of inter-document links, provided as a graph of
links in a site or between sites.
(3) Web Usage Mining [3] deals with the discovery and analysis of “interesting” patterns from
click-stream and associated data collected during the interactions with Web server on one or more
Web sites.
Fig. 1 The web mining taxonomy
Cluster analysis is one of the most commonly used data mining algorithms. Based on the particular
similarity criteria, clustering algorithm splits the sample space into a plurality of different sub-spaces,
sample points in each sub-space are similar with each other, while sample points from different
sub-spaces are not. Its essence is to find the different hidden data models through an unsupervised
learning process. Using cluster algorithms to carry out customer segmentation is to cluster similar
customers into a same sub-space, while making the differences between different types of customers
maximized.
952 Mechatronics and Industrial Informatics II
Clustering algorithm
Data mining can automatically extract the data from a large number of potentially valuable
knowledge, models or rules, belongings to discover technologies. It can help decision-makers find the
law, elements and predict trends. Clustering is a popular data mining technique that attempts to
partition a dataset into a meaningful set of mutually exclusive clusters.
Data types and data structures in cluster analysis.
(1) Interval-scaled variable
Interval-scaled variable is a rough linear scale for continuous variables. Since the data is divided
into different categories, we define the difference degree function. Function is used to measure the
degree of difference between the data of the same type of similarity or dissimilarity between the
different classes. Because different data samples may have multiple attributes [4].
There are different methods measure different attributes, will bring a lot of different clustering
results. So we need to standardize the data.
ip p
P
p
x mZ
S
−=
(1)
The average absolute error pS,
1
1 n
P ip p
i
S x mn =
= −∑ (2)
pmis the average value of p ,
1
1 n
p ip
i
m xn =
= ∑ (3)
In the case of noise points, the mean absolute error has better robustness, than the standard
deviation. After standardizing the data , it can be similarity measure. Calculating the distance between
sample points.
Generally includes the following method of calculating the distance.
Euclidean distance:
1
( , )p
i j i j ik jk
k
D x x x x x x=
= − = −∑ (4)
Manhattan distance:
1
( , )p
i j ik jk
k
D x x x x=
= −∑ (5)
Minkowski distance: 1
1
( , ) ( )p m
m
m i j ik jk
k
D x x x x=
= − ∑
(6)
(2) Binary variable
There are two binary variables values states, namely 0 and 1, binary variables can be divided into
symmetric binary variable and asymmetric binary variables according to their status value. Symmetric
binary variable, the two states have the same values and the same weights, asymmetric binary variable,
the two states do not have the same values and the same weights. Assume two states of binary variable
values are as follows:
Applied Mechanics and Materials Vol. 596 953
Object i
Object j
1
0
sum
1 0 sum
a b
c d
a+b
c+d
a+c b+d p
Simple matching coefficient evaluated the degree of difference is as follows:
( , )r s
d i jq r s t
+=
+ + + (7)
Jaccard coefficient used to evaluate the degree of difference is as follows:
( , )r s
d i jq r s
+=
+ + (8)
(3) The nominal and ordinal variables
The nominal variables are promotion of binary variable. Different from the binary variable, it can
have plurality of state values . State values are not comparable between the size, also disordered
arrangement. The nominal variable degree of difference between objects can be matched by a simple
matching coefficient.
( , )p m
d i jp
−=
(9)
p is the number of attributes, m is the number of attributes can be matched.
Ordinal variables like the nominal variables. Ordinal variables can be divided into discrete and
continuous. Discrete ordinal variables similar to the nominal variables, the only difference is that it
makes sense to sort sequence. Continuous ordinal variables are similar to the interval-scaled variables,
the difference is that it has no units, the actual size of the result variable is not meaningful, but the
value of the variable order is important.
1
1
if
if
f
rZ
m
−=
− (10)
(4) Proportion criteria variable
Proportion criteria variable usually refers to non-linear scale is positive metric. The difference
degree calculating in three ways: use the interval-scaled variables method; Do the logarithmic
transformation, then use the interval-scaled variables method; Regard as continuous ordinal variables,
then use the interval-scaled variables method.
(5) Mixed Variables
Generally, one real database contains four data types described above. One method is the database
variables are grouped by type. Respectively, do the clustering analysis for each type of variable. If
these variables results obtained after clustering can be mutually compatible, this method is feasible.
Another method is to mix all types of variables to a difference matrix, cluster only once. ( ) ( )
1
( )
1
( , )
p f f
ij ijf
p f
ijf
dd i j
δ
δ
=
=
=∑∑
(11)
954 Mechatronics and Industrial Informatics II
Data structure. Do the data preprocess to the collected data before cluster analysis. In the cluster
analysis, the following two structures [5].
(1) Data matrix
Data matrix can be seen as a two-dimensional matrix p×q. Rows and columns represent different
entities, the general form of this data structure to relational tables. Before the cluster analysis,
converted to the difference matrix
11 1
1
n
m mn
x x
x x
…
…
(2) Dissimilarity matrix
Dissimilarity matrix is a p×q matrix to describe the p dissimilarity between objects. In the
dissimilarity matrix, the rows and columns represent the same entity, the matrix elements of each row
and each column has the same dimension. Most of the cluster algorithms are using this data structure.
( )( ) ( )
( ) ( )
02,1,
0
02,31,3
01,2
0
……ndnd
dd
d
Cluster Analysis. There are numerous algorithms available for doing clustering. T Cluster
analysis techniques generally can be divided into five categories: clustering based on the division,
clustering based on hierarchical, density-based clustering, grid-based clustering and model-based
clustering. Typically, these clustering algorithms while providing summary statistics on the generated
set of clusters (e.g. mean of each variable, distance between clusters) [6], do not provide easily
interpretable detailed descriptions of the set of clusters that are generated. Further, for a given dataset,
different algorithms may give different sets of clusters, so it is never clear which algorithm and which
parameter settings (e.g. number of clusters) is the most appropriate [7]. They were included a lot of
specific algorithms, and the relevant clustering algorithm below.
Model-b
ase
d A
lgorith
m
Patitio
nin
g-b
ase
d
Alg
orith
m
Density
-base
d A
lgorith
m
Hie
rarc
hic
al-b
ase
d
Alg
orith
m
Grid
-base
d A
lgorith
m
Sta
tistical A
lgo
rithm
Neura
l Netw
ork
Alg
orith
m
Sw
arm
Alg
orith
m
K-M
eans
K-M
edoid
s
DB
SC
AN
OP
TIC
S
Ag
glo
mera
tive
Hie
rarc
hic
al A
lgo
rithm
So
litH
iera
rch
ical
Alg
orith
m
ST
ING
CL
IQU
E
DE
NC
LU
E
Fig. 2 The web mining taxonomy
This research used an improved K-means algorithm.
(1) K-means
In K-means clustering, clusters are created based on centroids for cluster. Data is clustered into
different groups based on its nearness to the centroids. The process is carried out until the centroid
stops shifting. The main objective in K-means clustering is to minimize the distance between data
points within a cluster, in other words, the objective is to minimize total squared error [8].
Applied Mechanics and Materials Vol. 596 955
Where, the function is equal to the distance between the data point Xi and the centroid Cj. The
following Figure illustrates steps involved in K-means clustering:
Fig. 3 K-means clustering steps
Algorithm is as follows:
a) Given a data set D contains n data; given number of clusters k, and the initial cluster centers ( ), 1,2, ,fZ I j k=
.
b) Calculated the distance from each data object to the cluster center
( , ( )), 1, 2, , ; 1, 2, ,i kD x Z I i n j k= =,meet the:
( , ( )) min{ ( , ( )), 1,2,3, , },i k i k i kD x Z I D x Z I j n x w= = ∈ (12)
And in accordance with the minimum distance, each object is assigned to the closest cluster.
c) Recalculate the mean of each cluster, and to identify new cluster center; Calculated the
squared error criterion function 2
( )
1 1
( ) ( )jnk
j
c k j
j k
J I x Z I= =
= −∑∑ (13)
d) Judge: if( ) ( 1)c cJ I J I ξ− − <
, then the algorithm ends; Otherwise, calculate k new cluster
centers,
( )
1
1( ) , 1, 2,3, ,
jn
j
j i
i
Z I x j kn =
= =∑, step back.
e) Output k-clustering collections
Using k-means algorithm, need to pre-given number of clusters; For users who lack experience in
the industry, it is difficult to find the most appropriate clustering; K-means algorithm is highly
dependent initial cluster centers; Algorithm time complexity is high; For categorical attribute data,
clustering results are unsatisfactory [9]. So, this paper proposes an improved K-means algorithm.
(2) Improved K-means algorithm
Improved K-means algorithm [10] is as follows:
a) Input data set, and initialization parameters.
b) Run iterative process, and get the K clustering results;
Test whether the clustering result converges, if it is convergence, calculation Sil (K) and mark. Sil
is Silhouette indicators ( ) [ ( ) ( )] / max{ ( ), ( )}Sil t b t e t e t b t= − (14)
In which, ( ) min{ ( , )}, 1, , ,ib t d t C i k i j= = ≠ .
Test whether the cluster center meets the convergence criteria, if it is convergence, obtained K
clusters, calculation Sil max.
956 Mechatronics and Industrial Informatics II
Test Sil max corresponding optimal cluster number, calculated Hartigan indicators.
( )( ) ( 1) 1
( 1)
trSW kHa k n k
trSW k
= − − −
+ (15)
c) Use number of clusters K and cluster centers initialization the K-means algorithm;
Run k-means algorithm to get the final clustering results.
Experiment and Analysis
We usually use the web site visitor logs, or CRM information, do the data preprocessing, establish
relevant model, use clustering method to segment customers, provide the basis for enterprises to make
decisions. The data used in this paper from an e-commerce site. Data including customer information
table, merchandise information table, tables and other customer orders,so data attributes numerous
and complex.
Customer segmentation based on K-means. Set K = 4, K-means customer segmentation results
are as follows:
Table 1 K-means clustering
Clustering
categories
Cluster centroids Number of
samples
1 (-1.1890,1.6119) 19
2 (-0.6786,0.1418) 30
3 (0.3360,-0.2445) 22
4 (1.2261,-1.0173) 29
Select the sample data, after data preprocessing and data standardization, the application of
algorithms for customer points, as follows:
Fig. 4 K-means clustering
Customer segmentation based on Improved K-means. Improved K-means customer
segmentation results are as follows:
Table 2 Improved K-means clustering
Clustering
categories
Cluster
centroids
Number
of
samples
Number
of noise
data
1 (-1.1890,1.6119) 12 3
2 (-0.6269,0.1806) 26
3 (0.1798,-0.4982) 20
4 (1.2261,-1.0173) 29
Applied Mechanics and Materials Vol. 596 957
Select the sample data, after data preprocessing and data standardization, the application of
algorithms for customer points, as follows:
Fig. 5 Improved K-means clustering
We can see that the improved K-means clustering algorithm after the data is standardized, the data
also exclude the three noise points and get better clustering results. Therefore, improved K-means
clustering algorithm is better than K-means clustering algorithm.
Combined with the above table, we can get the following conclusions:
1) 4th class of customers, the largest number, customers in this category are characterized by less
frequent consumption, the average amount of consumption rarely. Combined with
customer-related information can be seen, this type of customers are less educated, lower income,
age and location of the uneven distribution;
2) 3nd class of customers, the smallest, the average consumption of the least number of such
customers, but the average amount of consumption is very high, observed that such customers
educated, high income, mostly gathered in Beijing, Shanghai, Guangzhou and other cities, between
the ages of 25-35;
3) 2nd class of customers, the average consumer more frequently, the amount of consumption in
general, they are about 30-year-old, general education, general revenue, mostly from second and
third tier cities;
4) First class of customers, the average consumption frequency less than the second class of
customers, the amount of consumption is very high, they are mostly highly educated, high income,
age is generally 35 to 45 years, and the amount of consumption is also high.
Acknowledgment
Our thanks to Xi Zhou, Lei Wang, and all the members of the research center of multilingual
information technology. This work is founded by West Light Foundation of The Chinese Academy of
Sciences (No. XBBS201313), and the National High Technology Research and Development
Program of China (No. 2013AA01A607).
This work is founded by West Light Foundation of The Chinese Academy of Sciences (No.
XBBS201313), the National High Technology Research and Development Program of China (No.
2013AA01A607).
958 Mechatronics and Industrial Informatics II
References
[1] Barak A., Gelbard R., “Classification by clustering decision tree-like classifier based on adjusted
clusters”; Expert Systems with Applications, 38, 7, 2011, 8220-8228.
[2] Srivatsava, J., R. Cooley, M. Deshpande and P.N. Tan, 2000. Web usage mining: discovery and
applications of usage patterns from Web data. ACM SIGKDD Explorat. Newsletter, 1: 12-23.
DOI: 10.1145/846183.846188
[3] Jianxi Zhang,Peiying Zhao, Lin Shang and Lunsheng Wang, "Web Usage Mining Based On
Fuzzy Clustering in Identifying Target Group", International Colloquium on Computing,
Communication, Control, and Management, Vol. 4, Pp. 209-212, 2009.
[4] R. Dubes, Cluster analysis and related issues, in: C. Chen, L. Pau, P. Wang (Eds.), Handbook of
Pattern Recognition and Computer Vision, World Scientific Publishing Co. Inc., River Edge, NJ,
1993, pp. 3–32.
[5] D. Fisher, Knowledge acquisition via incremental conceptual clustering, Machine Learning 2,
1987, 139–172.
[6] L. Wallace, M. Keil, A. Rai, Understanding software project risk: a cluster analysis, Information
and Management 42, 2004, 115–155.
[7] Innovations in Intelligent Machines, pp. 1–19. Springer, Heidelberg, 2011.
[8] Patnaik, Sovan Kumar, Soumya Sahoo, and Dillip Kumar Swain, “Clustering of Categorical Data
by Assigning Rank through Statistical Approach,” International Journal of Computer
Applications 43.2: 1-3, 2012.
[9] Manish Verma, Mauly Srivastava, Neha Chack, Atul Kumar Diswar, Nidhi Gupta,” A
Comparative Study of Various Clustering Algorithms in Data Mining,” International Journal of
Engineering Reserch and Applications (IJERA), Vol. 2, Issue 3, pp.1379-1384, 2012.
[10] Anoop Kumar Jain, Satyam Maheswari, “Survey of Recent Clustering Techniques in Data
Mining”, International Journal of Computer Science and Management Research, Volume 1, issue
1, 2013, 72-78.
Applied Mechanics and Materials Vol. 596 959
Mechatronics and Industrial Informatics II 10.4028/www.scientific.net/AMM.596 Applying Improved Clustering Algorithm into EC Environment Data Mining 10.4028/www.scientific.net/AMM.596.951