applying improved clustering algorithm into ec environment data mining

10
Applying improved clustering algorithm into EC environment Data mining Yupeng Ma 1, a , Bo Ma 1,b and Tonghai Jiang 1,c 1 Research Center for Multilingual Information Technology Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Sciences Urumqi, Xinjiang Province, China a [email protected], b [email protected], c [email protected] Keywords: EC environment, Customer segmentation, K-means, improved K-means. Abstract. With the rising growth of electronic commerce (EC) customers, EC service providers are keen to analyze the on-line browsing behavior of the customers in their web site and learn their specific features. Clustering is a popular non-directed learning data mining technique for partitioning a dataset into a set of clusters. Although there are many clustering algorithms, none is superior for the task of customer segmentation. This suggests that a proper clustering algorithm should be generated for EC environment. In this paper we are concerned with the situation and proposed an improved k-means algorithm, which is effective to exclude the noisy data and improve the clustering accuracy. The experimental results performed on real EC environment are provided to demonstrate the effectiveness and feasibility of the proposed approach. Introduction In electronic commerce environment, Customer Relationship Management (CRM) can be defined as the process that manages and supports the interactions between a company and its customers. A general framework for a CRM system would include analytical components that are customer focused components, used to analyze existing customer data. Another component would be operational component that includes functional tools used to achieve customer centric goals and finally customer contact component that includes channels or medium used to interact with the customer directly. CRM helps locate relevant information about customers and use it to market the product to vital segments of customers and be able to obtain feedback about how successfully the customer needs were satisfied [1]. CRM encompasses all activities carried out with the customer in focus, from the business point of view and customer point of view. Along with Information Systems automation, sales force and marketing system automation and infrastructure development, an important element of CRM is analytics. Analytics involve use of several scientific techniques to analyze customer data that is available, and make use of it to derive predictive conclusions about the information, that will then help to make business decisions. The biggest hurdle faced in analytics of this nature is not the lack of information, but more so the abundance of it and the failure to utilize all of this data to derive something meaningful. As technological improvements have been made to CRM tools, many traditional, statistical tools have been incorporated into the system, to assist in carrying out analysis of this nature. Options available currently facilitate the use of more than one tool at a time, to come up with a solution most suitable for our needs. Thus, integrating useful, predictive tools is vital for CRM to derive information from existing information. Applied Mechanics and Materials Vol. 596 (2014) pp 951-959 Submitted: 07.05.2014 Online available since 2014/Jul/18 at www.scientific.net Accepted: 13.05.2014 © (2014) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/AMM.596.951 All rights reserved. No part of contents of this paper may be reproduced or transmitted in any form or by any means without the written permission of TTP, www.ttp.net. (ID: 130.194.20.173, Monash University Library, Clayton, Australia-05/12/14,19:54:32)

Upload: tong-hai

Post on 07-Apr-2017

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Applying Improved Clustering Algorithm into EC Environment Data Mining

Applying improved clustering algorithm into EC environment

Data mining

Yupeng Ma1, a, Bo Ma1,b and Tonghai Jiang1,c 1 Research Center for Multilingual Information Technology

Xinjiang Technical Institute of Physics &

Chemistry, Chinese Academy of Sciences

Urumqi, Xinjiang Province, China

[email protected],

[email protected],

[email protected]

Keywords: EC environment, Customer segmentation, K-means, improved K-means.

Abstract. With the rising growth of electronic commerce (EC) customers, EC service providers are

keen to analyze the on-line browsing behavior of the customers in their web site and learn their

specific features. Clustering is a popular non-directed learning data mining technique for partitioning

a dataset into a set of clusters. Although there are many clustering algorithms, none is superior for the

task of customer segmentation. This suggests that a proper clustering algorithm should be generated

for EC environment. In this paper we are concerned with the situation and proposed an improved

k-means algorithm, which is effective to exclude the noisy data and improve the clustering accuracy.

The experimental results performed on real EC environment are provided to demonstrate the

effectiveness and feasibility of the proposed approach.

Introduction

In electronic commerce environment, Customer Relationship Management (CRM) can be defined

as the process that manages and supports the interactions between a company and its customers. A

general framework for a CRM system would include analytical components that are customer focused

components, used to analyze existing customer data. Another component would be operational

component that includes functional tools used to achieve customer centric goals and finally customer

contact component that includes channels or medium used to interact with the customer directly.

CRM helps locate relevant information about customers and use it to market the product to vital

segments of customers and be able to obtain feedback about how successfully the customer needs

were satisfied [1].

CRM encompasses all activities carried out with the customer in focus, from the business point of

view and customer point of view. Along with Information Systems automation, sales force and

marketing system automation and infrastructure development, an important element of CRM is

analytics.

Analytics involve use of several scientific techniques to analyze customer data that is available,

and make use of it to derive predictive conclusions about the information, that will then help to make

business decisions. The biggest hurdle faced in analytics of this nature is not the lack of information,

but more so the abundance of it and the failure to utilize all of this data to derive something

meaningful. As technological improvements have been made to CRM tools, many traditional,

statistical tools have been incorporated into the system, to assist in carrying out analysis of this nature.

Options available currently facilitate the use of more than one tool at a time, to come up with a

solution most suitable for our needs. Thus, integrating useful, predictive tools is vital for CRM to

derive information from existing information.

Applied Mechanics and Materials Vol. 596 (2014) pp 951-959 Submitted: 07.05.2014Online available since 2014/Jul/18 at www.scientific.net Accepted: 13.05.2014© (2014) Trans Tech Publications, Switzerlanddoi:10.4028/www.scientific.net/AMM.596.951

All rights reserved. No part of contents of this paper may be reproduced or transmitted in any form or by any means without the written permission of TTP,www.ttp.net. (ID: 130.194.20.173, Monash University Library, Clayton, Australia-05/12/14,19:54:32)

Page 2: Applying Improved Clustering Algorithm into EC Environment Data Mining

The collective use of statistical tools to extract customer information, analyze it and infer customer

behavior patterns, based on it is termed as Data mining. In all systems developed in the recent

times,data mining techniques are an integral part of CRM. By definition, data mining refers to

extracting or mining knowledge from large amounts of data.

Web Data Mining

The rising of data mining technology has solved the problem of complex customer segmentation.

Data mining is the process of extracting potentially useful information, finding people concerned

pattern and hidden relationships between data and variation from massive fuzzy random data, which

allows people to recognize data from a higher level and mine potential value to provide support for

decision-making. Data mining technologies can help companies to find the potential customers’

information from massive data, to transform customers’ information in the database into useful

characteristic information and cluster the similar customers, and thus evaluating valuable customers

and provide different services to different customers.

Web Mining. The web mining can be broadly categorized as Web Content Mining [2], Web

Structure Mining, and Web Usage Mining and shown in Fig. 1.

(1) Web Content Mining deals with the discovery of multimedia documents, involving texts,

hypertexts, images, audio and video information and their automatic categorization.

(2) Web Structure Mining deals with the finding of inter-document links, provided as a graph of

links in a site or between sites.

(3) Web Usage Mining [3] deals with the discovery and analysis of “interesting” patterns from

click-stream and associated data collected during the interactions with Web server on one or more

Web sites.

Fig. 1 The web mining taxonomy

Cluster analysis is one of the most commonly used data mining algorithms. Based on the particular

similarity criteria, clustering algorithm splits the sample space into a plurality of different sub-spaces,

sample points in each sub-space are similar with each other, while sample points from different

sub-spaces are not. Its essence is to find the different hidden data models through an unsupervised

learning process. Using cluster algorithms to carry out customer segmentation is to cluster similar

customers into a same sub-space, while making the differences between different types of customers

maximized.

952 Mechatronics and Industrial Informatics II

Page 3: Applying Improved Clustering Algorithm into EC Environment Data Mining

Clustering algorithm

Data mining can automatically extract the data from a large number of potentially valuable

knowledge, models or rules, belongings to discover technologies. It can help decision-makers find the

law, elements and predict trends. Clustering is a popular data mining technique that attempts to

partition a dataset into a meaningful set of mutually exclusive clusters.

Data types and data structures in cluster analysis.

(1) Interval-scaled variable

Interval-scaled variable is a rough linear scale for continuous variables. Since the data is divided

into different categories, we define the difference degree function. Function is used to measure the

degree of difference between the data of the same type of similarity or dissimilarity between the

different classes. Because different data samples may have multiple attributes [4].

There are different methods measure different attributes, will bring a lot of different clustering

results. So we need to standardize the data.

ip p

P

p

x mZ

S

−=

(1)

The average absolute error pS,

1

1 n

P ip p

i

S x mn =

= −∑ (2)

pmis the average value of p ,

1

1 n

p ip

i

m xn =

= ∑ (3)

In the case of noise points, the mean absolute error has better robustness, than the standard

deviation. After standardizing the data , it can be similarity measure. Calculating the distance between

sample points.

Generally includes the following method of calculating the distance.

Euclidean distance:

1

( , )p

i j i j ik jk

k

D x x x x x x=

= − = −∑ (4)

Manhattan distance:

1

( , )p

i j ik jk

k

D x x x x=

= −∑ (5)

Minkowski distance: 1

1

( , ) ( )p m

m

m i j ik jk

k

D x x x x=

= − ∑

(6)

(2) Binary variable

There are two binary variables values states, namely 0 and 1, binary variables can be divided into

symmetric binary variable and asymmetric binary variables according to their status value. Symmetric

binary variable, the two states have the same values and the same weights, asymmetric binary variable,

the two states do not have the same values and the same weights. Assume two states of binary variable

values are as follows:

Applied Mechanics and Materials Vol. 596 953

Page 4: Applying Improved Clustering Algorithm into EC Environment Data Mining

Object i

Object j

1

0

sum

1 0 sum

a b

c d

a+b

c+d

a+c b+d p

Simple matching coefficient evaluated the degree of difference is as follows:

( , )r s

d i jq r s t

+=

+ + + (7)

Jaccard coefficient used to evaluate the degree of difference is as follows:

( , )r s

d i jq r s

+=

+ + (8)

(3) The nominal and ordinal variables

The nominal variables are promotion of binary variable. Different from the binary variable, it can

have plurality of state values . State values are not comparable between the size, also disordered

arrangement. The nominal variable degree of difference between objects can be matched by a simple

matching coefficient.

( , )p m

d i jp

−=

(9)

p is the number of attributes, m is the number of attributes can be matched.

Ordinal variables like the nominal variables. Ordinal variables can be divided into discrete and

continuous. Discrete ordinal variables similar to the nominal variables, the only difference is that it

makes sense to sort sequence. Continuous ordinal variables are similar to the interval-scaled variables,

the difference is that it has no units, the actual size of the result variable is not meaningful, but the

value of the variable order is important.

1

1

if

if

f

rZ

m

−=

− (10)

(4) Proportion criteria variable

Proportion criteria variable usually refers to non-linear scale is positive metric. The difference

degree calculating in three ways: use the interval-scaled variables method; Do the logarithmic

transformation, then use the interval-scaled variables method; Regard as continuous ordinal variables,

then use the interval-scaled variables method.

(5) Mixed Variables

Generally, one real database contains four data types described above. One method is the database

variables are grouped by type. Respectively, do the clustering analysis for each type of variable. If

these variables results obtained after clustering can be mutually compatible, this method is feasible.

Another method is to mix all types of variables to a difference matrix, cluster only once. ( ) ( )

1

( )

1

( , )

p f f

ij ijf

p f

ijf

dd i j

δ

δ

=

=

=∑∑

(11)

954 Mechatronics and Industrial Informatics II

Page 5: Applying Improved Clustering Algorithm into EC Environment Data Mining

Data structure. Do the data preprocess to the collected data before cluster analysis. In the cluster

analysis, the following two structures [5].

(1) Data matrix

Data matrix can be seen as a two-dimensional matrix p×q. Rows and columns represent different

entities, the general form of this data structure to relational tables. Before the cluster analysis,

converted to the difference matrix

11 1

1

n

m mn

x x

x x

(2) Dissimilarity matrix

Dissimilarity matrix is a p×q matrix to describe the p dissimilarity between objects. In the

dissimilarity matrix, the rows and columns represent the same entity, the matrix elements of each row

and each column has the same dimension. Most of the cluster algorithms are using this data structure.

( )( ) ( )

( ) ( )

02,1,

0

02,31,3

01,2

0

……ndnd

dd

d

Cluster Analysis. There are numerous algorithms available for doing clustering. T Cluster

analysis techniques generally can be divided into five categories: clustering based on the division,

clustering based on hierarchical, density-based clustering, grid-based clustering and model-based

clustering. Typically, these clustering algorithms while providing summary statistics on the generated

set of clusters (e.g. mean of each variable, distance between clusters) [6], do not provide easily

interpretable detailed descriptions of the set of clusters that are generated. Further, for a given dataset,

different algorithms may give different sets of clusters, so it is never clear which algorithm and which

parameter settings (e.g. number of clusters) is the most appropriate [7]. They were included a lot of

specific algorithms, and the relevant clustering algorithm below.

Model-b

ase

d A

lgorith

m

Patitio

nin

g-b

ase

d

Alg

orith

m

Density

-base

d A

lgorith

m

Hie

rarc

hic

al-b

ase

d

Alg

orith

m

Grid

-base

d A

lgorith

m

Sta

tistical A

lgo

rithm

Neura

l Netw

ork

Alg

orith

m

Sw

arm

Alg

orith

m

K-M

eans

K-M

edoid

s

DB

SC

AN

OP

TIC

S

Ag

glo

mera

tive

Hie

rarc

hic

al A

lgo

rithm

So

litH

iera

rch

ical

Alg

orith

m

ST

ING

CL

IQU

E

DE

NC

LU

E

Fig. 2 The web mining taxonomy

This research used an improved K-means algorithm.

(1) K-means

In K-means clustering, clusters are created based on centroids for cluster. Data is clustered into

different groups based on its nearness to the centroids. The process is carried out until the centroid

stops shifting. The main objective in K-means clustering is to minimize the distance between data

points within a cluster, in other words, the objective is to minimize total squared error [8].

Applied Mechanics and Materials Vol. 596 955

Page 6: Applying Improved Clustering Algorithm into EC Environment Data Mining

Where, the function is equal to the distance between the data point Xi and the centroid Cj. The

following Figure illustrates steps involved in K-means clustering:

Fig. 3 K-means clustering steps

Algorithm is as follows:

a) Given a data set D contains n data; given number of clusters k, and the initial cluster centers ( ), 1,2, ,fZ I j k=

.

b) Calculated the distance from each data object to the cluster center

( , ( )), 1, 2, , ; 1, 2, ,i kD x Z I i n j k= =,meet the:

( , ( )) min{ ( , ( )), 1,2,3, , },i k i k i kD x Z I D x Z I j n x w= = ∈ (12)

And in accordance with the minimum distance, each object is assigned to the closest cluster.

c) Recalculate the mean of each cluster, and to identify new cluster center; Calculated the

squared error criterion function 2

( )

1 1

( ) ( )jnk

j

c k j

j k

J I x Z I= =

= −∑∑ (13)

d) Judge: if( ) ( 1)c cJ I J I ξ− − <

, then the algorithm ends; Otherwise, calculate k new cluster

centers,

( )

1

1( ) , 1, 2,3, ,

jn

j

j i

i

Z I x j kn =

= =∑, step back.

e) Output k-clustering collections

Using k-means algorithm, need to pre-given number of clusters; For users who lack experience in

the industry, it is difficult to find the most appropriate clustering; K-means algorithm is highly

dependent initial cluster centers; Algorithm time complexity is high; For categorical attribute data,

clustering results are unsatisfactory [9]. So, this paper proposes an improved K-means algorithm.

(2) Improved K-means algorithm

Improved K-means algorithm [10] is as follows:

a) Input data set, and initialization parameters.

b) Run iterative process, and get the K clustering results;

Test whether the clustering result converges, if it is convergence, calculation Sil (K) and mark. Sil

is Silhouette indicators ( ) [ ( ) ( )] / max{ ( ), ( )}Sil t b t e t e t b t= − (14)

In which, ( ) min{ ( , )}, 1, , ,ib t d t C i k i j= = ≠ .

Test whether the cluster center meets the convergence criteria, if it is convergence, obtained K

clusters, calculation Sil max.

956 Mechatronics and Industrial Informatics II

Page 7: Applying Improved Clustering Algorithm into EC Environment Data Mining

Test Sil max corresponding optimal cluster number, calculated Hartigan indicators.

( )( ) ( 1) 1

( 1)

trSW kHa k n k

trSW k

= − − −

+ (15)

c) Use number of clusters K and cluster centers initialization the K-means algorithm;

Run k-means algorithm to get the final clustering results.

Experiment and Analysis

We usually use the web site visitor logs, or CRM information, do the data preprocessing, establish

relevant model, use clustering method to segment customers, provide the basis for enterprises to make

decisions. The data used in this paper from an e-commerce site. Data including customer information

table, merchandise information table, tables and other customer orders,so data attributes numerous

and complex.

Customer segmentation based on K-means. Set K = 4, K-means customer segmentation results

are as follows:

Table 1 K-means clustering

Clustering

categories

Cluster centroids Number of

samples

1 (-1.1890,1.6119) 19

2 (-0.6786,0.1418) 30

3 (0.3360,-0.2445) 22

4 (1.2261,-1.0173) 29

Select the sample data, after data preprocessing and data standardization, the application of

algorithms for customer points, as follows:

Fig. 4 K-means clustering

Customer segmentation based on Improved K-means. Improved K-means customer

segmentation results are as follows:

Table 2 Improved K-means clustering

Clustering

categories

Cluster

centroids

Number

of

samples

Number

of noise

data

1 (-1.1890,1.6119) 12 3

2 (-0.6269,0.1806) 26

3 (0.1798,-0.4982) 20

4 (1.2261,-1.0173) 29

Applied Mechanics and Materials Vol. 596 957

Page 8: Applying Improved Clustering Algorithm into EC Environment Data Mining

Select the sample data, after data preprocessing and data standardization, the application of

algorithms for customer points, as follows:

Fig. 5 Improved K-means clustering

We can see that the improved K-means clustering algorithm after the data is standardized, the data

also exclude the three noise points and get better clustering results. Therefore, improved K-means

clustering algorithm is better than K-means clustering algorithm.

Combined with the above table, we can get the following conclusions:

1) 4th class of customers, the largest number, customers in this category are characterized by less

frequent consumption, the average amount of consumption rarely. Combined with

customer-related information can be seen, this type of customers are less educated, lower income,

age and location of the uneven distribution;

2) 3nd class of customers, the smallest, the average consumption of the least number of such

customers, but the average amount of consumption is very high, observed that such customers

educated, high income, mostly gathered in Beijing, Shanghai, Guangzhou and other cities, between

the ages of 25-35;

3) 2nd class of customers, the average consumer more frequently, the amount of consumption in

general, they are about 30-year-old, general education, general revenue, mostly from second and

third tier cities;

4) First class of customers, the average consumption frequency less than the second class of

customers, the amount of consumption is very high, they are mostly highly educated, high income,

age is generally 35 to 45 years, and the amount of consumption is also high.

Acknowledgment

Our thanks to Xi Zhou, Lei Wang, and all the members of the research center of multilingual

information technology. This work is founded by West Light Foundation of The Chinese Academy of

Sciences (No. XBBS201313), and the National High Technology Research and Development

Program of China (No. 2013AA01A607).

This work is founded by West Light Foundation of The Chinese Academy of Sciences (No.

XBBS201313), the National High Technology Research and Development Program of China (No.

2013AA01A607).

958 Mechatronics and Industrial Informatics II

Page 9: Applying Improved Clustering Algorithm into EC Environment Data Mining

References

[1] Barak A., Gelbard R., “Classification by clustering decision tree-like classifier based on adjusted

clusters”; Expert Systems with Applications, 38, 7, 2011, 8220-8228.

[2] Srivatsava, J., R. Cooley, M. Deshpande and P.N. Tan, 2000. Web usage mining: discovery and

applications of usage patterns from Web data. ACM SIGKDD Explorat. Newsletter, 1: 12-23.

DOI: 10.1145/846183.846188

[3] Jianxi Zhang,Peiying Zhao, Lin Shang and Lunsheng Wang, "Web Usage Mining Based On

Fuzzy Clustering in Identifying Target Group", International Colloquium on Computing,

Communication, Control, and Management, Vol. 4, Pp. 209-212, 2009.

[4] R. Dubes, Cluster analysis and related issues, in: C. Chen, L. Pau, P. Wang (Eds.), Handbook of

Pattern Recognition and Computer Vision, World Scientific Publishing Co. Inc., River Edge, NJ,

1993, pp. 3–32.

[5] D. Fisher, Knowledge acquisition via incremental conceptual clustering, Machine Learning 2,

1987, 139–172.

[6] L. Wallace, M. Keil, A. Rai, Understanding software project risk: a cluster analysis, Information

and Management 42, 2004, 115–155.

[7] Innovations in Intelligent Machines, pp. 1–19. Springer, Heidelberg, 2011.

[8] Patnaik, Sovan Kumar, Soumya Sahoo, and Dillip Kumar Swain, “Clustering of Categorical Data

by Assigning Rank through Statistical Approach,” International Journal of Computer

Applications 43.2: 1-3, 2012.

[9] Manish Verma, Mauly Srivastava, Neha Chack, Atul Kumar Diswar, Nidhi Gupta,” A

Comparative Study of Various Clustering Algorithms in Data Mining,” International Journal of

Engineering Reserch and Applications (IJERA), Vol. 2, Issue 3, pp.1379-1384, 2012.

[10] Anoop Kumar Jain, Satyam Maheswari, “Survey of Recent Clustering Techniques in Data

Mining”, International Journal of Computer Science and Management Research, Volume 1, issue

1, 2013, 72-78.

Applied Mechanics and Materials Vol. 596 959

Page 10: Applying Improved Clustering Algorithm into EC Environment Data Mining

Mechatronics and Industrial Informatics II 10.4028/www.scientific.net/AMM.596 Applying Improved Clustering Algorithm into EC Environment Data Mining 10.4028/www.scientific.net/AMM.596.951