[studies in fuzziness and soft computing] modeling uncertainty with fuzzy logic volume 240 ||...

54
A. Celikyilmaz and I.B. Türksen: Model. Uncertain. Fuzzy Logic, STUDFUZZ 240, pp. 51–104. springerlink.com © Springer-Verlag Berlin Heidelberg 2009 Chapter 3 Improved Fuzzy Clustering 3 Improved Fuzzy Clustering The new fuzzy system modeling approach based on fuzzy functions implements fuzzy clustering 1 algorithm during structure identification of the given system. This chapter introduces foundations of fuzzy clustering algorithms and compares different types of well-known fuzzy clustering approaches. Then, a new improved fuzzy clustering approach is presented to be used for fuzzy functions approaches to re-shape membership values into powerful predictors. Lastly, two new cluster validity indices 2 are introduced to be used to validate the improved fuzzy clustering algorithm results. Everything is vague to a degree, you do not realize till you have tried to make it precise —Bertrand Russell 3.1 Introduction In 1965 and later in 1975, Zadeh [1965;1975a] introduced the concept of mathematical modeling with imprecise propositions, fuzzy sets and fuzzy logic. Since then, fuzzy sets and logic has been applied on many areas to control uncertain information and to simulate how inferences can be made with uncertain information. It is known that noises and uncertainties can never be totally eliminated in any given database system. Generally, a common way is to use fuzzy logic theory in clustering problems to further explain these types of uncertainties. Fuzzy clustering algorithms can identify overlapping clusters in a given dataset, while calculating membership values that specify to what degree each object belong to such clusters. In the next section, we present firstly the terminology of the fuzzy clustering methods and a general classification of fuzzy cluster analysis. The “Fuzzy 1 This section of this chapter is an extension of papers [Celikyilmaz, Turksen, 2007a,b]. 2 This section of this chapter is an extension of papers [Celikyilmaz, Turksen, 2007c and 2008c].

Upload: i-burhan

Post on 06-Aug-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

A. Celikyilmaz and I.B. Türksen: Model. Uncertain. Fuzzy Logic, STUDFUZZ 240, pp. 51–104. springerlink.com © Springer-Verlag Berlin Heidelberg 2009

Chapter 3 Improved Fuzzy Clustering

3 Improved Fuzzy Clustering

The new fuzzy system modeling approach based on fuzzy functions implements fuzzy clustering1 algorithm during structure identification of the given system. This chapter introduces foundations of fuzzy clustering algorithms and compares different types of well-known fuzzy clustering approaches. Then, a new improved fuzzy clustering approach is presented to be used for fuzzy functions approaches to re-shape membership values into powerful predictors. Lastly, two new cluster validity indices2 are introduced to be used to validate the improved fuzzy clustering algorithm results.

Everything is vague to a degree, you do not realize

till you have tried to make it precise —Bertrand Russell

3.1 Introduction

In 1965 and later in 1975, Zadeh [1965;1975a] introduced the concept of mathematical modeling with imprecise propositions, fuzzy sets and fuzzy logic. Since then, fuzzy sets and logic has been applied on many areas to control uncertain information and to simulate how inferences can be made with uncertain information. It is known that noises and uncertainties can never be totally eliminated in any given database system. Generally, a common way is to use fuzzy logic theory in clustering problems to further explain these types of uncertainties. Fuzzy clustering algorithms can identify overlapping clusters in a given dataset, while calculating membership values that specify to what degree each object belong to such clusters.

In the next section, we present firstly the terminology of the fuzzy clustering methods and a general classification of fuzzy cluster analysis. The “Fuzzy

1 This section of this chapter is an extension of papers [Celikyilmaz, Turksen, 2007a,b]. 2 This section of this chapter is an extension of papers [Celikyilmaz, Turksen, 2007c and

2008c].

52 3 Improved Fuzzy Clustering

Functions” system modeling approach of this work (Chapter 4) utilizes the most commonly used type of fuzzy clustering algorithm, namely Fuzzy c-Means (FCM) clustering algorithm [Bezdek, 1981a] for structure identification. Therefore, mathematical background of FCM is presented in this chapter. Next, a novel Improved Fuzzy Clustering (IFC) algorithm instead of FCM clustering algorithm is presented to be utilized in “Fuzzy Functions” approaches for prediction problems (regression). Motivation and theory of the IFC algorithm is presented along with its mathematical transformation. An extension to the novel IFC, entitled Improved Fuzzy Clustering for classification – IFC-C - is also proposed to be applied on pattern recognition problem domains. Later, justification of the performance of IFC algorithm in comparison to FCM clustering algorithm is discussed with examples. Lastly, two new cluster validity index (CVI) are presented to be used to find the optimum number of clusters with IFC algorithm for classification and regression type system domains. Using artificial datasets, performance of the new CVIs is discussed and the results are compared to the results from other well known CVIs.

3.2 Fuzzy Clustering Algorithms

System modeling with soft computing can be broadly separated into following two main parts: global and local system modeling [Babuška and Verbruggen, 1997]. In global modeling, overall system is analyzed as a whole to understand underlying relationships, if any. In local modeling, the system under study is first de-composed into meaningful parts and sub-models are built using linear or non-linear methods. Hence, the class of fuzzy clustering algorithms is used to identify these local models.

There are numerous ongoing research on enhancing fuzzy clustering algorithms and these can be classified as follows based on the clustering structure:

Fuzzy clustering based on fuzzy relations,

Fuzzy clustering based on objective function and covariance matrix,

Non-parametric classifier, e.g., fuzzy generalized k-nearest neighbour rule,

Neuro-fuzzy clustering, e.g., self organizing maps, fuzzy learning vector quantization, etc.

This work focuses on “objective” based fuzzy clustering algorithms, which assign an error or a quantitative measure to each possible cluster partition using an evaluation function and try to minimize/maximize total error/quality. Ideal solution is reached when cluster-partitions are assessed to obtain the best evaluation. Objective based clustering algorithms try to solve an optimization problem.

Definition 3.1. (Objective Function) Objective function J(f) or J is an error or quantitative measure and our aim in fuzzy clustering algorithms is to find the global minimum or maximum of J depending on the structure of the clustering algorithm. J is used to compare different solutions usually for the same clustering problem.

3.2 Fuzzy Clustering Algorithms 53

Let X={x1,x2,…,xn} represent a set of n objects, each k object, k=1,…,n, is represented with nv dimensional vector, xk=[x1,k,…, xnv,k ]T ∈ ℜnv. A set of n vectors is then represented by n×nv data matrix

1 1 1 2 1 nv

n 1 n 2 n nv

x x x

X

x x x

× × ×

× × ×

⎡ ⎤⎢ ⎥= ⎢ ⎥⎢ ⎥⎣ ⎦

(3.1)

A fuzzy clustering algorithm partitions given dataset X into c number of overlapping clusters, forming a fuzzy partition matrix, U.

Definition 3.2. (Fuzzy Partition Matrix). Fuzzy partition matrix, U, is a matrix of degrees of memberships of every object xk, k=1,…,n in every cluster i, i=1,…,c. The value of the degree of membership of kth vector in cluster i is represented by μik∈U. The partition matrix is in the form:

1,1 2,1 c ,1

1,n 2,n c ,n

U

μ μ μ

μ μ μ

⎡ ⎤⎢ ⎥= ⎢ ⎥⎢ ⎥⎣ ⎦

(3.2)

In fuzzy clustering algorithm, each cluster is represented with a vector called vector of “cluster centers” or “cluster prototypes”, with which cluster structures in overall X can be represented.

Definition 3.3. (Vector Cluster Centers/Prototype). In a dataset X of nv dimensional vectors, fuzzy clustering algorithm identifies c number of cluster center vectors, V={υ1,υ2,…,υc}∈ℜc×nv, where each cluster center ,υi∈ℜnv

is also an nv dimensional vector. Each cluster center (υi) is usually represented as the centroids of nv objects, e.g., average of all the datum of the corresponding cluster.

Among many different types of fuzzy clustering algorithms, this work deals with the objective function based and point-wise (distance-based) clustering algorithms. Since an extension of the presented system modeling approach with “Fuzzy Functions”, to be discussed in the following chapters, executes fuzzy c-means clustering (FCM) algorithm for a structure identification of the presented fuzzy system modeling; detailed explanations of this algorithm will be presented next.

3.2.1 Fuzzy C-Means Clustering Algorithm

Fuzzy c-Means (FCM) [Bezdek, 1981a] clustering algorithm is a simple and yet the most commonly used and extended fuzzy clustering method of all fuzzy clustering approaches. In FCM clustering algorithm, it is assumed that number of clusters, c, is known or at least fixed, i.e., FCM algorithm partitions given data set X = {x1,...,xn} into c clusters. Since the assumption of a known or a previously fixed number of clusters is not realistic for many data analysis problems, there are

54 3 Improved Fuzzy Clustering

techniques such as cluster validity index (CVI) analysis to determine the number of clusters of FCM clustering algorithm. Some well known CVIs are discussed and two new cluster validity criteria are introduced in the later sub-sections of this chapter.

Let each c cluster be represented with a cluster prototype, vi. FCM clustering algorithm, tries to minimize an objective function with two prior information, number of clusters, c, and fuzziness parameter, m, as follows:

min J(X;U,V) = ( ) ( )c n m 2ik k ii 1 k 1

d x ,μ υ= =∑ ∑

(3.3)

In (3.3), m∈(1,∞) represents the “degree of fuzziness”3 or “fuzzifier” of the fuzzy clustering algorithm and it determines the degree of overlapping of the clusters. “m=1” would mean no overlapping, which represents a crisp clustering structure. Here, d2(xk,υi) is a measure of the distance between kth object and ith cluster center. Squared distances satisfy that objective function is non-negative, J>0. The objective function will be 0 when all data objects are clusters at the centers, c=n. On the other hand, when data objects are farther away from cluster centers, υi, objective function will get larger. The location and the number of cluster centers affect the value of the objective function. Criterion of the objective function is minimum at the optimum solution, and one should search for the global minimum. In order to avoid trivial solutions, two constrains are imposed on the partition matrix U, as follows:

c

iki 11, kμ

== ∀∑ >0 (3.4)

< n

ikk 10 n, iμ

=< ∀∑ >0

(3.5)

The constraint in (3.4) implies that each row of partition matrix in (3.2) adds up to 14. The constraint in (3.5) implies that the column total of membership values cannot exceed the number of data vectors, n, nor it can be zero. This indicates that there is at least one member assigned to each cluster. However, none of these constraints force membership values of each cluster to have a certain distribution. General formula of the distance measure is given by the following formula:

( ) ( ) ( )T2k i k i i k id x , x A x 0υ υ υ= − − ≥ (3.6)

3 Fuzziness is a type of uncertainty (of imprecision) accepted in uncertainty theory [Zadeh,

1965, 1975a]. Various functions have been proposed to measure the degree of fuzziness measure. In fuzzy clustering algorithms, the overlapping constant, m, is used as the degree of fuzziness. In later chapters m will be used as a parameter to define uncertainty in proposed fuzzy functions approach, along with other measures.

4 In some research such as Krishnapuram and Keller (1993), (3.4) is indexed in possibilistic approach to clustering.

3.2 Fuzzy Clustering Algorithms 55

In (3.6) the norm matrix Ai, i=1…c, is positive definite symmetric matrix. Other distance measures can also be used in fuzzy clustering algorithms. A short list of different distance measures are given in Table 3.1. FCM clustering algorithm uses Euclidean distance, therefore norm matrix, Ai, is equal to the identity matrix since input matrix is scaled to standard deviation 1 and mean equals 0, e.g., A=I. On the other hand, Gustafson and Kessel [1979] use Mahalanobis distance, in which case norm matrix of each cluster is equal to inverse of the covariance matrix of each cluster, e.g., Ai=Ci

-1.

Table 3.1 Distance Measures

Distance Measure Function Euclidean Distance 1/ 2

22

1

( , ) ( )nv

i ii

d a b a b=

⎡ ⎤= −⎢ ⎥⎣ ⎦∑

Minkowski Distance 1/

1

( , ) | | , 0ps

pp i i

i

d a b a b p=

⎡ ⎤= − >⎢ ⎥⎣ ⎦∑

Maximum Distance 1

( , ) max | |nv

i ii

d a b a b∞ == −

Mahalanobis Distance ( , ) ( ) ( )TAd a b a b A a b= − −

From (3.3)-(3.6), one can imply that FCM clustering algorithm is a constraint optimization problem, which should be minimized in order to obtain optimum results. Therefore, FCM clustering algorithm can be written as a single optimization structure as follows:

( ) ( ) ( )

s.t. 0

0<

c n m 2ik k ii 1 k 1 A

ik

c

iki 1

n

ikk 1

min J X ;U ,V d x ,

1, i,k

1, k 0

n, i 0

μ υ

μ

μ

μ

= =

=

=

=

≤ ≤ ∀

= ∀ >

< ∀ >

∑ ∑

∑∑

(3.7)

Constraint optimization model in (3.7) [Bezdek, 1981a] can be solved using a well-known method in mathematics, namely Lagrange Multiplier method [Khuri, 2003], and the model is converted into an unconstraint optimization problem with one objective function. In order to get an equality constraint problem, primal constraint optimization problem is first converted into an equivalent unconstraint problem with the help of unspecified parameters known as Lagrange Multipliers, λ;

56 3 Improved Fuzzy Clustering

max W(U,V) = ( ) ( )c n m 2ik k ii 1 k 1 A

d x ,μ υ= =∑ ∑ -λ ( )ik

c

i 11μ

=−∑

(3.8)

According to Lagrangian Method, the Lagrangian function must be minimized with respect to primal parameters and maximized with respect to dual parameters. According to the derivative of Lagrangian function in (3.8) with respect to the original model parameters, U and V should vanish. Hence, by taking the derivative of objective function in (3.8) with respect to cluster centers, V and membership values, U, optimum membership value calculation equation and clusters centers are formulized by:

( )( )

12( 1) 1

( )

( 1)1

,

,

t mck it

ik tj k j

d x

d x

υμ

υ

− −

−=

⎡ ⎤⎛ ⎞⎢ ⎥⎜ ⎟= ⎢ ⎥⎜ ⎟⎢ ⎥⎝ ⎠

⎣ ⎦

(3.9)

( ) ( )( ) ( ) ( )

1 1

, 1,...,n nm mt t t

i ik k ikk k

x i cυ μ μ= =

⎛ ⎞= ∀ =⎜ ⎟⎝ ⎠∑ ∑ (3.10)

In (3.9) υi(t-1) represent cluster center vector for cluster i obtained in (t-1)th

iteration. Similarly, in (3.9) and (3.10), μik(t) denotes optimum membership values

calculated at tth iteration. The proof of extracting the membership value calculation formula in (3.9) and the cluster center function in (3.10) can be found in Appendix B.1. The result from this operation shows that membership values and cluster centers are dependent on each other, so Bezdek [1981a] proposed an iterative algorithm to calculate membership values and cluster centers. Objective function Jt at each iteration, t, is measured by

J(t) = ( ) ( )mc n ( t ) 2 ( t )ik k ii 1 k 1

d x ,μ υ= =∑ ∑ >0 (3.11)

FCM algorithm stops according to a termination criterion, e.g., either after certain number of iterations, or magnitude of separation of two nearest clusters is less than a pre-determined value (ε), etc. Iterative algorithm of FCM clustering algorithm is shown in ALGORITHM 3.1.

The effect of the fuzziness value, m, can be analyzed by taking the limit of the membership value calculation equation in (3.9) at the boundaries as follows:

( ) ( )

( )

112 1

21

, 1lim lim , , 1,..., .

,

mck i

ikm m

j k j

d xx i j c

cd x

υμ

υ

→∞ →∞ =

⎡ ⎤⎛ ⎞⎢ ⎥⎜ ⎟= = ∀ =⎢ ⎥⎜ ⎟⎢ ⎥⎝ ⎠

⎣ ⎦

(3.12)

3.2 Fuzzy Clustering Algorithms 57

ALGORITHM 3.1 Fuzzy C-Means Clustering Algorithm (FCM)

Given data vectors, X={x1,..,xn}, number of clusters, c, degree of fuzziness, m, and termination constant, ε (maximum iteration number in this case). Initialize the partition matrix, U, randomly.

Step 1: Find initial cluster centers using (3.10) using membership values of initial partition matrix as inputs.

Step 2: Start iteration t=1…max-iteration value;

Step 2.1. Calculate membership values of each input data object k in cluster i, ( )tikμ , using

the membership value calculation equation in (3.9), where xk are input data objects as vectors and ( 1)t

iυ − are cluster centers from (t-1)th iteration,

Step 2.2. Calculate cluster center of each cluster i at iteration t, ( )tiυ using the cluster

center function in (3.10), where the inputs are the input data matrix, xk, and the membership values of iteration t, ( )t

ikμ .

Step 2.3. Stop if termination condition satisfied, e.g., | ( )tiυ - ( 1)t

iυ − |≤ε. Otherwise go to step 1.

and under the assumption that no cluster centers are alike, we get;

( ) ( ) ( )2 2

1

1

0k i k j

ikm

if d x , <d x , , i, j=1,...,c,j i,lim x

, otherwise

υ υμ

⎧ ∀ ≠⎪= ⎨⎪⎩

(3.13)

As the value of m is increased, μik will converge to 0, e.g., when m=6, any strong membership such as μik=0.85 will be decreased somewhere close to 1/c (see (3.12)). Since m parameter represents degree of overlap of clusters, as m gets larger, the fuzzier the results would be and overlapping will be wider. As m gets smaller, fuzzy clustering result will be more close to a crisp clustering model. m=1 is same as crisp clustering where there is no overlapping between clusters, and all the membership values are μik ∈ {0,1}.

Earlier research [Turksen, 1999] indicates that, as a rule of thumb, m=2 should be used in system modeling analysis. In a more recent study [Ozkan and Turksen, 2007], the maximum and minimum values of m was mathematically proven to be m∈[1.4,2.6] based on Taylor expansion analysis of the membership value calculation function. Also in this work, the fuzziness interval, m∈[m-lower,m-upper] will be investigated based on a new fuzzy modeling approach in uncertainty modeling chapter, in Chapter 5, entitled “Modeling uncertainty with Improved Fuzzy Functions”.

FCM [Bezdek, 1981a] clustering algorithm can be regarded as the standard fuzzy clustering algorithm and many extensions of FCM clustering algorithm have been proposed for various different purposes. FCM clustering is also the core structure of the proposed improved clustering algorithm (IFC) of this work, to be explained in the following sub-sections. In the next sub-section, different classifications of fuzzy clustering algorithms are reviewed.

58 3 Improved Fuzzy Clustering

3.2.2 Classification of Objective Based Fuzzy Clustering Algorithms

Objective function based fuzzy clustering algorithms other than FCM clustering algorithm can be categorized into three parts based on their purpose in system modeling as follows:

1. Algorithms that use adaptive distance measures [Gustafson and Kessel, 1979] or fuzzy maximum likelihood estimation methods [Gath and Geva, 1989],

2. Algorithms based on fuzzy linear variants and fuzzy c-elliptotypes [Bezdek et al., 1981b; Bezdek et al., 1981c],

3. Fuzzy c-regression algorithms using prototypes defined by regression functions [Hathaway and Bezdek, 1993].

Gustafson and Kessel [1979] proposed that for each cluster, different symmetric and positive semi definite matrix, A, should be used. Gath and Geva’s method [1989] is an extension of Gustafson and Kessel [1979] algorithm that also uses the size and density of the point-wise clusters. On the other hand, fuzzy c-varieties algorithm by Bezdek [1981a; 1981b, and 1981c] was developed for recognition of lines, planes and hyper-planes. Shell fuzzy cluster algorithms are also recently developed methods for recognizing circular, elliptical and parabolic shapes [Höppner et al, 1999]. Fuzzy c-regression model clustering algorithm, to be discussed in the following, falls into this category. Since the scope of the work only covers the third category of fuzzy clustering algorithms, we will leave out the definitions of the first two algorithms.

3.2.3 Fuzzy C-Regression Model (FCRM) Clustering Algorithm

The objective of fuzzy c-regression model (FCRM) clustering algorithm [Hathaway and Bezdek, 1993], as in all clustering algorithms, is to classify objects into similar groups. FCRM clustering algorithm yields simultaneous estimates of parameters of c regression models, while fuzzy partitioning a given dataset. A prominent feature of this clustering algorithm that separates it from other point-wise clustering algorithms, e.g. FCM, is that, cluster prototypes are functions instead of geometrical objects. FCRM clustering algorithm can be used to separate linear patterns, where each pattern can be identified by a linear function. Hathaway and Bezdek [1993] displayed the domain of FCRM algorithm with a simpler example as shown in Figure 3.1. The artificial dataset used for this example consists of two regression models, e.g., linear functions. In [Hathaway and Bezdek, 1993] these models are called switching regression models and FCRM algorithm tries to identify these structures using fuzzy sets. Although FCRM clustering algorithms are based on standard FCM [Bezdek, 1981a] clustering algorithm, there are various differences between them. These differences can be briefly listed as follows:

3.2 Fuzzy Clustering Algorithms 59

• In FCM [Bezdek, 1981a] clustering algorithm, clusters are hyper-sphere shaped, whereas in FCRM clustering algorithm [Hathaway and Bezdek, 1993], clusters are hyperplane-shaped.

• The representatives of clusters of FCM are cluster centers, υi, whereas the representatives of clusters in FCRM are hyper-planes, which are represented with nv-dimensional inputs and an output, e.g., for multi-input, single output model, by

0 11 ... nv

i i i i nvy x xβ β β= + + + , (3.14)

where βi are the regression coefficients of each function i, i=1…c.

• FCM algorithm calculates cluster centers by averaging each data vector weighted with their membership values. FCRM calculates cluster representative functions by weighted least squares regression algorithm.

In order to demonstrate FCRM clustering algorithm, an artificial dataset in Figure 3.1 is composed using two different linear functions with random noise, ε, of mean=0 and standard-deviation=0.05, using MATLAB’s rand(x) function, where x∈[0,1] , i.e. f1(x)=rand(x)+ε, and f2(x)=ε+0.3.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.70

0.1

0.2

0.3

0.4

0.5

0.6

0.7

x

y

f1(x)f2(x)

Fig. 3.1 Scatter Plot of artificial data with two different random functions. f1(x)=y+ random_noise, f2(x)=0.3+random_noise.

FCRM clustering algorithm yields simultaneous estimates of parameters of c regression models together with a fuzzy c partitioning of the data [Hathaway and Bezdek, 1993]. Each regression model is defined as;

yk = fi (xk , βi) (3.15)

60 3 Improved Fuzzy Clustering

In (3.17) xk=[x1,k,…,xnv,k]T∈ℜnv denotes kth data object and βi ∈ℜnv , i=1,…,c, are

the parameters of functions fi and c represents total number of functions. Performance of captured functions is generally measured by

Eik(βi) = (yk - fi (xk, βi))2 (3.16)

The objective function to minimize the total error of these approximated functions is calculated by

( ) ( )(U, )=c n m

i ik ik ii 1 k 1E Eβ μ β

= =∑ ∑ (3.17)

where m ∈ (1,∞) is the “fuzzifier” exponent, which determines overlapping of the functions slightly different than FCM [Bezdek, 1981a] algorithm. In FCRM clustering algorithm as m approaches to ∞ each function will have almost the same parameters, i.e., coefficients (weights), βi, graphically speaking all the functions will be on top of each other. On the other hand, as m approaches 1, the functions will be as discrete from each other as possible. Therefore, choice of parameter m affects the performance of the FCRM clustering models as well.

In FCRM modeling algorithm, membership values, μik, are interpreted as weights, i.e., coefficients of linear or polynomial regression functions. They represent to what extent values predicted by the model, fi (xk,βi), is close to yk. Therefore, based on membership value calculation equation of FCM in (3.10), membership value calculation equation of FCRM algorithm is re-formulated in [Hathaway and Bezdek, 1993] as follows:

11

m 1cik

ik j 1jk

E , i, j 1,...,c nEμ

=

⎡ ⎤⎛ ⎞⎢ ⎥= ∀ = <⎜ ⎟⎢ ⎥⎝ ⎠

⎣ ⎦∑ (3.18)

Minimization of the objective function in (3.17) yields an optimization model as shown in ALGORITHM 3.2.

ALGORITHM 3.2 Fuzzy C-Roegression Models Clustering Algorithm (FCRM)

Given the data vectors, X={x1,.., xn}, choose number of clusters, c, degree of fuzziness, m, and termination constant, ε, maximum iteration number, and the structure of regression models in (3.15). Initialize the partition matrix, U, randomly.

Start iteration t=1,….max-iteration value;

Step 1. Calculate the values of the model parameters βi that minimize the cost function in (3.17).

Step 2. Update partition matrix, μik∈U, using (3.18). Terminate if |U(t)-U(t-1)|≤ε. Otherwise go to step 1.

3.2 Fuzzy Clustering Algorithms 61 Hathaway and Bezdek [1993] searches for the optimum parameters of the

functions using weighted least-squares method. They use the membership degrees of the fuzzy partition matrix U as weights. In this specific situation, membership degrees, U input matrix, X, and output matrix, y are represented by

11,1

22,2

,,

0 0

0 0, ,

0 0

Tii

Tii

i i

Ti nni n

yx

yxX y U

yx

μμ

μ

⎡ ⎤ ⎡ ⎤⎡ ⎤⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥= = =⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥

⎢ ⎥⎢ ⎥ ⎣ ⎦ ⎣ ⎦⎣ ⎦

Parameters,β i, of each function, fi, are calculated with weighted least squares algorithm:

βi =[XTUiX]-1 XTUi y (3.19)

In FCRM clustering method, linear functions are generally formulized to find hidden structures in a given dataset. Possible extensions of FCRM implement non-linear functions to find hidden patterns. The new improved fuzzy clustering algorithm of this work, which utilizes FCRM type clustering algorithms, introduces additional non-linearity to the clustering algorithm by using non-linear algorithms to find functions in a novel way. For this purpose FCRM algorithm is one of the fundamental structures of the new clustering algorithm along with the standard FCM clustering algorithm. Next, we review different clustering algorithms, which combine regression and clustering methods in one structure.

3.2.4 Variations of Combined Fuzzy Clustering Algorithms

In 1993, Sugeno and Yasukawa published a paper on a FL-based approach to qualitative modeling in which they proposed achieving structure identification by using a combination of FCM clustering algorithm [Bezdek, 1981a] and group method of data handling. In their approach, number of rules was captured by minimizing the variance in each output cluster and maximizing the variance between clusters. The main advantage of their method is that they were able to separate structure identification from parameter identification. These clusters constitute building blocks of fuzzy rule based approaches. Despite the popularity of FCM clustering algorithm, other variations of clustering algorithms are also used in literature to build fuzzy system models for different purposes. One of the prominent FCM clustering variation, fuzzy c-regression methods (FCRM), presented in the previous section, is commonly used to fuzzy partition a given dataset for local function approximation. In one of these studies, Chen et al. [1998] presents a new fuzzy clustering algorithm combining fuzzy c-functions clustering and alike fuzzy c-means clustering algorithm. Their fuzzy c-functions clustering based on linear fuzzy regression.

In [Chen et al., 1998] Takagi-Sugeno-Kang (TSK) type system [Takagi and Sugeno, 1985; Sugeno and Kang, 1986] modeling structures are utilized. A non-linear optimization algorithm is applied to identify the parameters of premises and consequences of each rule. In there, consequence and premise parameters are

62 3 Improved Fuzzy Clustering

optimized in turns with one fixed and the other optimized. In a sequence, they develop different criteria for optimization of model parameters of premise and consequents based on fuzzy c-partition space. Even though their algorithm cannot be considered as one united clustering algorithm, they still manage to combine fuzzy c-variants and c-means clustering algorithms in one system modeling and use them interchangeably to find parameters of premise and consequents of fuzzy rules, but again not in one optimization algorithm.

On the other hand, Höppner and Klawonn [2003] combine FCM [Bezdek, 1981a] and FCRM [Hathaway and Bezdek, 1993] algorithms in one clustering schema, to build combined clustering structure. Their main goal was to update fuzzy clustering algorithm so that they can prevent the effect of harmonics by modifying the objective function. Their aim was to eliminate the counterintuitive membership values such that membership value of a linguistic term “young”, which is high for “17 years”, should not be higher for “23 years” than for “21 years”. They not only deal with point-wise clustering algorithms such as fuzzy c-means clustering (FCM) algorithm [Bezdek, 1981a], they also deal with fuzzy c-regression model clustering algorithms (FCRM) [Hathaway and Bezdek, 1993]. They modified the objective function of FCM clustering algorithm to yield the following membership value calculation equation based on their heuristic approaches:

( )( )

12 2ik ikc i 1...c

ik j 1 2 2jk ik

i 1...c

d min d,0

d min d

ημ η

η

==

=

⎡ ⎤− −⎢ ⎥= <⎢ ⎥− −⎢ ⎥⎣ ⎦∑ (3.20)

η >0 is a user defined constant. The objective in [Höppner and Klawonn, 2003] was to find membership values that can eliminate harmonics by pushing larger membership values towards “1” and smaller ones towards “0” based on distances

of objects. In [Höppner and Klawonn, 2003], each function, T

i i iˆy β= x , is

interpreted as a rule in a T-S model. Combining FCM and FCRM algorithms in one clustering schema (since both algorithms are objective based), Höppner and Klawonn [2003] introduced a new combined distance function, which is the combination of both methods as follows:

222 ˆ ˆ ˆ(( , ), ( , )) ( ) ( )T

k k i i k i k i kikd y v x yβ υ β= − + −FCRM distanceFCM distance

x x x (3.21)

In (3.23), (xk, yk) is a given input-output data sample, xk ∈ X, yk∈Y, k=1,..,n, n is the total number of training data vectors, d2 is distance function, υi is a cluster prototype of cluster i, which is the same as FCM clustering, cluster center function, i=1,…,c, c is the number of clusters. x i represents a user defined polynomial, for instance a two dimensional polynomial can be formed with the following vector,

x (x1, x2)=(1, x1, x2, x1x2 , x12, x2

2) (3.22)

3.2 Fuzzy Clustering Algorithms 63

and the coefficients of the polynomial are represented with ˆi

β ’s for each cluster i.

Hence, the first term of distance function in (3.21) is FCM distance measure and the second term is FCRM distance measure, which is equal to the error of estimated functions in (3.16). Based on the distance function in (3.21), they try to optimize partition matrix in their new fuzzy clustering algorithm, which uses the

membership value calculation equation in (3.20). The coefficients ˆiβ are obtained

in the same way as the cluster centers of FCM clustering are obtained. Therefore, the prototype update function of FCM clustering in (3.10) is replaced with;

( ) ( )1 1

ˆ ˆ ˆ ˆ( ) ( ) , 1,...,n nm m T

i ik k k ik k kk k

y x x x i cβ μ μ= =

⎛ ⎞= ∀ =⎜ ⎟⎝ ⎠∑ ∑ (3.23)

Their combined clustering algorithm is sketched in ALGORITHM 3.3.

ALGORITHM 3.3 Fuzzy Model Algorithm of Höppner and Klawonn (2003)

Step 1: Choose number of clusters, c, termination threshold, ε>0, η>0 ,

Step 2: Initialize the cluster prototypes, υi

Step 3: Update membership matrix using (3.20) and distances (3.21)

Step 4: Update prototypes using FCM’s prototype function in (3.10), and the coefficients using (3.23)

Step 5: Iterate until |U(t)- U(t-1) |≤ε.

It should be noted that, the membership value calculation equation in (3.20) is not the result of a mathematical transformation obtained from combining two clustering algorithms, FCM and FCRM. They do not specifically imply how they chose polynomials to approximate local non-linear functions.

Other variations of combined fuzzy clustering methods are also observed in the literature. For instance, in [Wang et al., 2002], FCRM algorithm is replaced with a gravity-based clustering algorithm, which is based on Newton’s gravity law. Their aim was to capture shell/curve like structures in a given data, which is suitable for image analysis. They combine gravity based clustering and fuzzy clustering algorithm to form an integrated clustering algorithm.

Leski [2004]’s combined approach to fuzzy clustering algorithm is based on a local non-linear function estimation of local fuzzy models using FCRM [Hathaway and Bezdek, 1993] algorithm. Leski [2004] introduced an e-insensitive fuzzy c-regression algorithm, by converting optimization algorithm of FCRM clustering algorithm into a similar support vector regression optimization algorithm to capture the outliers. Using fuzzy partition space, Leski [2004] formed TSK type fuzzy system models. One drawback of this approach is that it is computationally complex as opposed to the rest of the fuzzy clustering variations mentioned earlier.

64 3 Improved Fuzzy Clustering A major disadvantage of most of these combined fuzzy clustering approaches

reviewed above is that; the algorithms are only defined for training datasets, where output variable is known. Learning algorithms in these earlier approaches utilize output values of training samples during training methods. Similarly, the distance measure and membership value calculation equations use output values during clustering algorithm. In most investigations, it would be preferred if membership value calculation equations of these fuzzy clustering methods would be used to find membership values of validation or testing data objects, whose output variable are not known. Almost none of these approaches explain explicitly how they calculate membership values of verification or testing samples using their proposed membership value calculation equations, nor do they define other membership value calculation equation. In this work, in chapter 4, a novel training and testing algorithm are explicitly defined to explain how to handle such problems. A probabilistic case-based inference is proposed for reasoning.

Before moving on to explain the details of the proposed fuzzy clustering approach, it should be pointed out that, fuzzy clustering algorithms and their variations are designed for different purposes in system modeling. For instance, some fuzzy clustering methods are designed to capture noise in a given dataset, e.g., [Dave, 1991]; some try to eliminate harmonics [Kilic, 2002], others search for relationships through linear or non-linear functions, e.g., [Leski, 2004]. In this sense, the new improved fuzzy clustering algorithm has a novel goal. FCM and FCRM algorithms and their variations are the basis of the proposed improved fuzzy clustering algorithm of this work. Motivation and the underlying background of the new fuzzy clustering algorithm are presented in the next section.

3.3 Improved Fuzzy Clustering Algorithm (IFC)

3.3.1 Motivation

Since, multi-input, single-output (MISO) models are the interest of this work; a clustering algorithm is designed to approximate local models to explain relationship between inputs and outputs. Most of the times, earlier clustering algorithms are designed to find local groupings of data vectors by leaving out any linear or non-linear relationships between them. However, data objects could be grouped by considering not only the local models of input-output data (clusters) but also relationships between them. Here a new clustering algorithm is proposed to find local fuzzy partitions to estimate local fuzzy models with fuzzy functions [Celikyilmaz and Turksen, 2007b] based on combined clustering structures such as in (3.21).

“Fuzzy Functions” approach is initially proposed by Turksen in 2005 [Turksen, 2008]; the first implementation was published in [Celikyilmaz, 2005]. Later, Turksen and Celikyilmaz [Turksen and Celikyilmaz, 2006] published the idea as an alternate reasoning method to Fuzzy Rule Base structures. In simpler terms, instead of using fuzzy rule bases, each rule is converted into a “Fuzzy Function” as such ŷi=fi(Φ(μi,x)), i=1…c, to represent them with a linear or a non-linear function.

3.3 Improved Fuzzy Clustering Algorithm (IFC) 65

Feature space Φ(⋅) is similar to mapping original input space, x, onto a user defined higher dimensional space. The feature space dimension is determined by the user and it is formed by original input variables, membership values, μi, and or their mathematical transformations, for instance power transformations of membership values. In a sense, membership values are used as additional dimensions in approximating hyper-surfaces for each cluster identified by any type of fuzzy clustering algorithm. In the following chapters, we will analyze fuzzy system modeling with “Fuzzy Functions” approach in more detail. Here, only a brief introduction is given in order to explain the motivation of the proposed improved fuzzy clustering algorithm. Before introducing the details of the new clustering algorithm, its motivation are discussed in connection to earlier improved and combined fuzzy clustering algorithms such as combined clustering methods [Höppner and Klawonn, 2003].

Foundations of the Improved Fuzzy Clustering (IFC) approach of this work were introduced by Höppner & Klawonn [2000, 2003], Chen et al. [1998] and Menard [2001]. The structure of these earlier versions of fuzzy clustering algorithms and the IFC algorithm are similar in the sense that they share a similar objective function, which combines standard fuzzy clustering and fuzzy c-regression methods. However, there are many structural differences between the IFC of this work and earlier versions of FCM variations, viz. FCRM or improved clustering algorithms described in [Höppner & Klawonn, 2003; Chen et al., 1998; Menard, 2001]. These differences can be listed as follows (these earlier clustering types will be denoted as “earlier variations”):

• Earlier variations use polynomials to estimate parameters of estimated functions. Novel IFC can use any type of function including simple functions such as linear regression, or non-linear functions such as support vector regression [Gunn, 1998].

• Earlier versions and the IFC have different representations of membership values during structure and parameter identification. Earlier FCRM algorithms use membership values as weights in weighted regression algorithms, whereas the IFC method uses membership values as new predictors of functions in the sense of [Turksen, 2008] of each sub-structure (pattern) in the data. This is the unique property of “Fuzzy Function” methodologies proposed by Turksen in 2005 [2008].

• An earlier combined fuzzy clustering algorithm [Höppner and Klawonn, 2003] as reviewed above, uses a slightly different version of cluster prototype function as in (3.23) to update estimated parameters of the calculated functions. Whereas new IFC updates coefficients of Fuzzy Functions use any type of function estimation method, e.g., multiple linear regression, support vector regression or even neural networks.

• Membership values of data points calculated with earlier membership value calculation equations produce higher values such as close to 1 when corresponding data objects can explain local input-output relationships. Similarly they estimate lower membership values if corresponding data vectors are outliers of that particular local structure. Whereas in IFC,

66 3 Improved Fuzzy Clustering

membership values and their transformations are considered as additional candidate input variables, not just as an indicator of data points. Structure of the IFC algorithm forces membership values to be better predictors of local models.

• Earlier combined fuzzy clustering versions build Takagi and Sugeno [1985] based fuzzy system models, whereas IFC is proposed to be used in novel Fuzzy Functions systems [Celikyilmaz and Turksen, 2007b-k], to be presented in Chapter 4.

• Earlier fuzzy clustering algorithms require observed output variable, yk, to estimate the error of each function of each local structure (cluster) identified, see (3.16). Given some data points, the algorithm identifies in which part the input space they fall and use the local model fi(x) assigned to it to predict the output. The error between the observed output and the estimated output from each local model is used as an input to the membership value calculation equation in (3.18). In short, during training, one needs to know what the output value of a particular data point is, before it can actually estimate its membership values. This could be problematic for real life datasets where the output values are generally not known. The improved fuzzy clustering algorithm (IFC) also uses a similar structure, where membership value calculation equations require output as inputs to calculate membership values of each data vector in each of these structures. But, in this work, a new inference engine that can calculate an approximate output value using case-based methods is proposed in order to calculate the membership values of vectors with unknown output values. Earlier system modeling approaches that utilize FCM variations stated above [Hathaway and Bezdek, 1993; Leski, 2004; Höppner and Klawonn, 2002] do not explain how membership values of testing data vectors are calculated. In this sense, IFC clustering is a unique approach and can be applied to any type of data structure, even when there are missing output values in the dataset.

In this work, extending distance function in (3.21), a new Improved Fuzzy Clustering (IFC) is proposed. It should be noted that, new IFC is implemented in Fuzzy Functions (Fuzzy functions) systems [Celikyilmaz and Turksen, 2007b], where membership values and their user defined transformations are used as additional predictors in approximating the local input-output relationships. Earlier similar research [Höppner and Klawonn, 2003; Chen et al.,1998] use FRB structures, e.g., T-S models, where system modeling and membership values are used as weights of each local model.

It should be emphasized again for the sake of the clarity and novelty of the IFC that, earlier variations of improved clustering methods, e.g., [Höppner and Klawonn, 2003; Chen et al., 1998], etc, approximate a polynomial function for each local partition using original input variables to estimate output variable and append this error term as a second term in the distance function of the clustering, as shown in equation (3.21). The IFC clustering algorithm is designed to find membership values to be used to model the system behavior. At each step of the

3.3 Improved Fuzzy Clustering Algorithm (IFC) 67

IFC optimization, a special regression function is estimated for each cluster. In approximating these functions, only the membership values and their user-defined transformations are used as input variables. These regression functions are called

“Interim Fuzzy Functions” and are denoted as ( )i ih τ , where τi is the matrix of ith

cluster, consisting of the membership values and their user defined transformations. One could use a simple model to estimate the parameters of these functions using linear regression functions such as least squares regression (LSE) or build non-linear models such as support vector regression [Gunn, 1998] or neural networks [Kosko, 1992]. The residual of the interim fuzzy functions, i.e., (yk-[ŷk=hik(τi)])

2, is used as additional similarity information. Hence, they are added to the distance function as the second term. They are used as an additional term in the objective function of the the IFC algorithm. Since objective function based fuzzy clustering algorithms include a distance function as a similarity measure, the distance function should affect the behavior of the membership value calculation equation. In [Höppner and Klawonn, 2003], such an effect is not reflected to the membership value calculation equation in (3.20), instead the membership value calculation equation serves their particular purpose of finding more crisp membership values to eliminate harmonics. In this work, a new membership value calculation equation is introduced, which is structured as a result of the modified distance function.

On the other hand, earlier fuzzy c-regression models and combined fuzzy clustering algorithms use membership values of each cluster as weights in local regression models. Since the unique aim of the IFC algorithm is to find membership values that would be better predictor arguments of Fuzzy Functions of each local partition (cluster), interim fuzzy functions of the IFC algorithm only uses calculated membership values and their transformations as input variables. Therefore, in the new IFC clustering optimization approach, we do not include the original scalar inputs while shaping the membership values. We just aim at finding the best membership values that can explain the output. In short, the new IFC clustering introduces a new membership value calculation equation as a result of the effect of the addition of the second term to the distance function, consequently to the objective function of the IFC algorithm. Whereas the earlier approaches use original input variables to find local linear models.

Consequently, in the training algorithm of the Fuzzy Functions approach, after hidden structures are identified with IFC, we approximate regression functions for each of these fuzzy partitions (clusters) identified by IFC, using the membership values from IFC, their transformations and the original scalar input variables. We use linear regression methods, e.g., least squares estimation (LSE) as proposed by Turksen [2008] or non-linear regression methods e.g., support vector machines for regression (SVR) [Gunn, 1998] as proposed by Celikyilmaz [2005]. We call these local regression functions, the “Fuzzy Functions”. It is hypothesized that the modeling error of the proposed Fuzzy Function methods will be lower when the new IFC of this work is used instead of standard FCM [Bezdek, 1981a].

It should be emphasized that, in the earlier fuzzy system modeling algorithms, e.g., Zadeh fuzzy rule base structures [1975a], it is assumed that expert

68 3 Improved Fuzzy Clustering

information is available and linguistic description (the degree of memberships) are determined subjectively. This may cause some discrepancies since expert knowledge is subjective. Later, various fuzzy models have been developed implementing clustering algorithms into input-output data to find membership values. Among these studies are [Sugeno and Yasukawa, 1993; Delgado et al., 1997; Emami and Turksen, 1998] and others. In [Delgado et al., 1997] different approaches is presented for identification of fuzzy models including fuzzy clustering of the input-output (XY) domain, viz. Z={XY} input data space. They use FCM clustering to generate c fuzzy clusters in Z domain, i.e., Z = [X,Y] with centers denoted as υi

Z=(υiX, υi

Y) for i=1,…,c. Hence, membership value calculation equation of a fuzzy relation associated with the ith cluster is defined as:

( ) ( )( )

12

1

1

( ), ( ), 1

( ), ( )

Z mck i

ik Zj k j

d zm

d z

υμ

υ

=

⎛ ⎞⎛ ⎞⎜ ⎟⎜ ⎟= >⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠

⎝ ⎠

∑ 5 (3.24)

Hence Delgado et al. [1997] identify local relationships between inputs and outputs around centroids, which are captured by the above fuzzy relation on XY domain and then they define membership value calculation equation of individual domains, e.g., X and Y. Identifying fuzzy cluster structure as two separate components of inputs and outputs does not imply that the membership value calculation equations are projections of the clusters, rather they are induced in X and Y spaces by the fuzzy clusters. In this way, c fuzzy sets are obtained in X and Y spaces denoted by υi

X and υiY with membership value calculation equations

defined as:

( )( )

12

1

1

,,

,

X mc k iX

ik Xj k j

d x

d x

υμ

υ

=

⎛ ⎞⎛ ⎞⎜ ⎟⎜ ⎟= ⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠

⎝ ⎠

∑ ( )( )

12

1

1

,, 1

,

Y mc k iY

ik Yj k j

d ym

d y

υμ

υ

=

⎛ ⎞⎛ ⎞⎜ ⎟⎜ ⎟= >⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠

⎝ ⎠

∑ (3.25)

The assumption here is that the input variables are not independent. Instead of defining separate membership functions for each individual input variable, one multi-dimensional interactive membership function is formulated to represent the whole antecedent part of a fuzzy rule as defined in [Delgado, et al., 1997] by joint (interactive) membership values. This indicates that non-interactivity6 assumption between each component of the input space is not assumed as earlier fuzzy rule base approaches. Some examples of the applications of clustering input-output domain as well as the output domain can be found in [Emami and Turksen, 1998; Kemal, 2002; Uncu, 2003]. The new clustering algorithm is also based on the input-output clustering of the given system domain and also assumes interactivity between input variables. 5 tIndicates any iteration. 6 Non-interactivity assumption indicates that the antecedent part input variables are

assumed to independent from each other and hence there would be no interaction what so ever between their antecedent fuzzy sets.

3.3 Improved Fuzzy Clustering Algorithm (IFC) 69 The IFC algorithm finds membership values which are candidate inputs for

each local fuzzy model. These fuzzy models are regression functions, where output variable, y∈ℜ has a continuous domain. In this work, we also extended IFC method for classification domains, (IFC-C) where each object in the dataset belongs to one of the predefined classes. In chapter 4, a new fuzzy classifier design is introduced, which uses IFC-C clustering for fuzzy classifier models to solve classification problems. In there, the error i.e., (yk-ŷik)= (yk-h(τi)), is replaced with error between the actual binary output and the posterior probability, ˆ

ikp , of each object calculated by a chosen classification function, i.e., (yk- ˆ

ikp ). Next, the framework of the new improved fuzzy clustering (IFC) algorithm is

presented for regression and classification problem domains.

3.3.2 Improved Fuzzy Clustering Algorithm for Regression Models ( IFC )

Standard Fuzzy c-Means (FCM) algorithm [Bezdek, 1981a] is used in fuzzy system models to find the membership values, which are assumed to represent optimum partitions of the given dataset. In Fuzzy Functions approaches these membership values are used as additional input variables to predict the parameters of regression models for each cluster. In this work, we propose a new fuzzy clustering method by modifying standard FCM algorithm. The new IFC is proposed to find membership values, which can improve the performance of the local models represented with fuzzy functions. The optimization approach of the new improved fuzzy clustering algorithm not only searches for the best partition of the data, but also aims at increasing the predictive power of the membership values to model the output variable with Fuzzy Functions. Hence, we call the new fuzzy clustering algorithm “Improved Fuzzy Clustering (IFC)” [Celikyilmaz, Turksen, 2007b].

For the given multi-input and single-output system, let the input matrix be represented as xy={(x1,y1),…,(xn,yn)}, where xk={x1…xnv} is nv dimensional kth data vector, k=1…n, n is the total number of data vectors and each yk is their corresponding output value.

First, we introduce a new objective function, which carries out two purposes: (i) to find a good representation of the partition matrix; (ii) to find membership values to minimize the error of the Fuzzy Function models. To optimize the membership values, in the new IFC, we append the error we obtained from the regression functions to the objective function as follows:

( ) ( ) ( )22

1 1 1 1

ˆ( , )c n c nm mIFC imp imp

m ik ik ik k i ik ii k i k

J d y h wμ μ= = = =

= + −∑∑ ∑∑ τ (3.26)

In (3.26) impikμ is the improved membership value of the kth input vector in ith

cluster, i=1…c, and m is the degree of fuzziness parameter which determines the overlapping of the clusters. Objective function to be minimized, Jm

IFC in (3.26), comprises of two separate terms. The first term is same as the standard FCM

70 3 Improved Fuzzy Clustering

algorithm. This term controls the precision of each input-output data vector to its cluster, d2=||xkyk-υi(xy)||2 and vanishes when each data sample is equal to the cluster center value.

The second term is total squared error of Interim Fuzzy Functions hi(τi,ŵi) of its cluster, where τi is called the interim matrix and ŵi is the coefficient vector of cluster i. This term measures squared error of approximated user defined functions in the sense of which value transformations to be included in Fuzzy Functions that are built during the optimization of IFC algorithm. The only input variables are the membership values and/or their possible transformations, i.e., original scalar inputs are omitted. Therefore, input matrix, ( )imp

i iμτ , which is used to

estimate these Interim Fuzzy Functions, hi(τi,ŵi), i=1,…c, utilize the membership values from the previous iteration step. Let the input matrix of ith cluster be composed of two-dimensional input vectors of the membership values and their log-odds transformations, τi=[μi log((1-μi

imp)/μiimp)]. The set of planes in ℜ2 of

each ith cluster is defined as hi=ŵ0i+ŵ1iμiimp+ŵ2ilog((1-μi

imp)/μiimp), or hi=τi

Tŵi , where ŵi

T = [ŵ0i ŵ1i ŵ2i] are the coefficients of the Interim Fuzzy Functions. A particular set of fuzzy functions in ℜ2 is defined by

21

0 1 2 01

ˆ ˆ ˆ ˆ ˆ ˆ ˆ( , ) log( )imp

iimp

ii

impi i i i i i i i ji ji

j

y h w w w w w wμ

μμ τ−

=

== + + = +∑τ (3.27)

ŷi is the estimated output value at tth iteration in ith cluster, τi is the input matrix created by the improved membership values, μi

imp, of the ith cluster at tth iteration, along with their log-odd transformations (original scalar input variables, x, are not included) and ŵi’s are estimated parameters of the functions using a linear regression method, e.g., least squares regression method. Log-odds are a user defined transformation of the membership values, which is usually used in fuzzy functions because mostly the distribution of the membership values is Gaussian (bell shaped) in fuzzy clustering algorithms. The analysis of the distribution of the membership values are discussed in the next section.

It should be emphasized that, in (3.27) the only inputs are the membership values and their user defined transformations. We want to improve the predictive power of membership values to be used in system modeling as additional inputs. Hence, each tth iteration of IFC optimization tries to minimize the error between actual and estimated output values. Excluding original input variables, we are able to measure and improve the individual effect of the membership values on model performance. The distance function of the IFC algorithm is denoted by

( )2 2ˆ( ) ( , )IFCik k i k ik id z y h wυ τ= − + −z (3.28)

where zk∈Z={xk,yk}∈XY ⊆ℜ(nv+1) indicate the input-output vector, vi(z) is the ith cluster center. For each cluster one interim fuzzy function, hi(τi,ŵi), i=1,…,c, is approximated using membership values from the previous iteration as input variables. The second term vanishes when the estimated interim fuzzy functions of each cluster, hi(τi,ŵi), can perfectly explain the observed output variable. The trade off between the precision and squared error enables to define the optimal IFC model.

3.3 Improved Fuzzy Clustering Algorithm (IFC) 71 The solution to minimize the new objective function, Jm

IFC, can be found by taking the dual of the model by using the Lagrange transformation of the objective function and converting into a maximization problem. This is done by introducing Lagrange Multiplier, λ, with respect to the constraint ∑c

i=1 μikimp

= 1. We re-write the objective function as follows:

( ) ( ) ( ) ( )22

1 1 1 1 1

ˆ( , ) 1c n c n cm mimp imp imp

ik ik ik k i ik i ikm

i k i k i

L d y h wμ μ λ μ= = = = =

⎛ ⎞⎛ ⎞= + − −⎜ ⎟⎜ ⎟⎝ ⎠⎝ ⎠

∑∑ ∑∑ ∑τ - (3.29)

The solution to the maximization of the convex model is found by taking the derivative of the Lagrange function with respect to the unknown parameters, i.e., μik

imp, improved membership values and the cluster centers, υi. The input variables of regression models, hi(τi), to be used to approximate output variable are the membership values from the (t-1)th iteration. Therefore the second term, SEik

(t-1)= (yk-hi(τik

(t-1)))2, is a known variable at the tth iteration. Hence, by taking the derivative of the objective function in (3.29) with respect to the cluster center and the membership values, the optimum membership values of the proposed clustering algorithm are formulated as:

( ) ( ) ( )( ) ( )

11/( 1)2 2( 1) ( 1)( )

2 21 ( 1) ( 1)1 ,1

ˆ( , )

ˆ( , )

mt t

t c ik k i ik iimpik j t ti j c jk k j jk jk n

d y h w

d y h w

τμ

τ

−−− −

= − −< ≤≤ ≤

⎛ ⎞⎛ ⎞+ −⎜ ⎟⎜ ⎟= ⎜ ⎟⎜ ⎟+ −⎜ ⎟⎝ ⎠⎝ ⎠

∑ (3.30)

Since the second term of the objective function JmIFC in (3.26) does not include the

cluster center term, when taking the derivative of the objective function with respect to the cluster center, vi(xy), the second term will vanish. Then the objective function will be as same as the objective function of standard FCM algorithm. Therefore, the cluster center function of standard FCM remains unchanged for IFC and is given by

( )( ) ( )( )( ) ( )( )

11 1

n nm mt tt imp impi ik k iki c

k k

v xμ μ≤ ≤ = =

⎛ ⎞∀ = ⎜ ⎟⎝ ⎠∑ ∑ (3.31)

Extraction of membership value calculation equation in (3.30) using Lagrange Transformation is given in Appendix B.2. The membership value calculation equation in (3.30) uses cluster centers from the previous iteration, (t-1), and the cluster center equation in (3.31) uses the membership values from the current iteration. Hence, similar to the FCM algorithm, an iterative algorithm for the IFC algorithm is used to find the optimum membership values. The algorithm terminates according to some termination criteria, e.g., when total number of iterations exceeds a user defined maximum number of iterations or magnitude of the separation between two consecutive objective function values is below a threshold value, etc. Generalized framework of the IFC algorithm is shown in ALGORITHM 3.4.

72 3 Improved Fuzzy Clustering Let fuzzy partition of each iteration t, t=1,…,max-iteration, be denoted

by:

( ) ( )

( ) ( )

( ) ( )

1,1 ,

( )

( ) ( )

1, ,

t timp impc n

t

t timp impn c n

U

μ μ

μ μ

⎛ ⎞⎜ ⎟⎜ ⎟=⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠

ALGORITHM 3.4 Optimization with the Improved Fuzzy Clustering Algorithm (IFC) for

Regression Models

Given the training dataset, Z={(x1,y1),…, (xn,yn))}. Set m>1.1, c>1 and a termination constant,

ε>0, maximum number of iterations (max-iter), specify the structure of regression models such as

in (3.27) for each cluster i, i=1,…,c, k=1,…,n, to create the interim input matrix, τi. Using FCM

clustering algorithm, initialize the partition matrix, U0. Then, for each iteration, t=1,…max-iter; (1) Populate c number of input matrices, τi

(t-1), one for each cluster, using membership values (U(t-1)) from the (t-1)th iteration, and their selected user defined transformations.

(2) Approximate c number of interim fuzzy functions such as in (3.27),( -1)( )tikh τ .

(3) Update membership values for iteration t using (3.30). (4) Calculate cluster centers for iteration t using (3.31). (5) If (obj(t) - obj(t-1))<ε, then terminate, else return to step 1.

IFC optimization starts with an initial partition matrix, U0, and initial cluster centers, υ0. One may use a crisp clustering such as k-means or a fuzzy clustering such as fuzzy k-means, or FCM clustering, etc., to find the initial partition matrix, U0, and cluster centers, υ0. One should take into account the trade off between choosing a fuzzy clustering such as FCM clustering and discrete clustering methods like K-means. U0 and υ0 are required inputs at the start of the IFC since new membership value calculation equation in (3.30) requires error terms of the regression models, (yk- hi(τi,ŵi))

2. The model output ŷik= hi(τi,ŵi) for each cluster i is estimated by using only membership values and their mathematical transformations as input variables as in (3.27).

The IFC optimization method searches for the optimum membership values, which are to be used later as additional predictors to estimate the parameters of the Fuzzy Functions of the given system model. In steps 2 of ALGORITHM 3.4, membership values U t-1 that are calculated at (t-1)th iteration are used as input variables at tth iteration to find parameters of regression functions, ŵi, for each cluster. Any function approximator can be used to identify the parameters. In the experiments we use least squares estimation (LSE) to identify linear functions and support vector regression (SVR) [Gunn, 1998] to approximate non-linear functions [Celikyilmaz, 2005; Celikyilmaz and Turksen, 2007b]. Since the aim of IFC is to find membership values which are good predictors of the output, the only

3.3 Improved Fuzzy Clustering Algorithm (IFC) 73

input variables are the membership values and their user defined transformations when finding interim fuzzy functions of each cluster. The optimization algorithm tries to minimize the error of the regression models, hi(τik

(t-1),ŵi). One should choose appropriate membership value transformations to approximate the output. The effects of improved membership values, as predictors of the Fuzzy Functions, is investigated in the following sub-section, and entitled justification of the membership values of IFC algorithm.

The output of the IFC algorithm for a given m and c value is:

parameters of the interim fuzzy functions of each cluster, ŵi, i=1…c, that are captured from the last iteration step,

improved membership value matrix, U(z), and the cluster centers υ(z),

the interim input matrix structure, τi, composed of membership values and their transformations for each cluster i.

The fuzzy function systems utilize membership values from IFC as additional predictors. Hence, the new IFC algorithm is methodically designed to improve the performance of the membership values to be used as additional predictors of Fuzzy Functions systems.

3.3.3 Improved Fuzzy Clustering Algorithm for Classification Models (IFC-C)

In this section, an extension of the improved fuzzy clustering (IFC) algorithm to be implemented into classification models will be presented. In the previous section novel IFC algorithm is introduced to be implemented into fuzzy system models with Fuzzy Functions to solve regression problems. To adapt the IFC algorithm for classification problems (IFC-C), one need to change the way the decision function, namely Fuzzy Function parameters are calculated. It should be noted that, for regression problems, Interim Fuzzy Functions, ( ) ˆ( , )t

i ih wτ , of each cluster (at each step of the IFC iteration), tries to model a continuous output variable. However, for classification problems, the output variable is dichotomous, e.g., y∈{0,1}, y∈{-1,+1}, or ordinal/discrete variable, y∈{0,1,2,..}. One needs to implement a classifier function, e.g., logistic regression [Allison, 2001], support vector machines for classification [Gunn, 1998], or neural networks for classification [Kosko, 1992], etc., in order to assign class labels to each data point. In this work we only dealt with the classification problems, where the output variable is dichotomous, e.g., y∈{0,1}, y∈{-1,+1}. The type of the classification method to choose is usually a tread-off between complexity (non-linearity) and generalization capacity of the classification.

The second term of IFC objective function in (3.26) is the squared error calculated between an output of a form of interim fuzzy function and actual output of each data point. Hence, one first estimates output values, ŷik, of each data point k

74 3 Improved Fuzzy Clustering

in cluster i using interim fuzzy functions of ith cluster. Then the error between ŷik and observed output values, i.e., SEik=(yk-ŷik)

2, is measured. For classification functions, the output is a binary variable which only takes values 0 or 1. One uses a classification method, i.e., logistic regression (LR), support vector machines for classification (SVC) or neural networks for classification (NNC), to calculate a decision function, ( )h iτ , so that the sign of the decision function, sign(hk(τik)), is the

predicted output of object k in cluster i. Instead of predicting the label, sign(hk(τik)), which is a binary outcome, one can calculate posterior probability of the output, ˆ

ikp (yk=1|)τik) ∈[0,1], of each data point in each cluster based on the probability of

the output to be (yk=1). Hence, estimated probability indicates how likely the output of the given observation would be 1. It is a scalar variable and one could form a performance measure between the actual normalized output values of each object and estimated posterior probability, as such (yk - ˆ

ikp )2. Then, we append the error of

these classification functions onto the objective function of the standard FCM clustering to formulize the new hybrid structure of IFC-C as follows:

( ) ( ) ( )22

1 1 1 1

ˆ ˆ( 1| , )c n c nm mIFC C imp imp

m ik ik ik k ik k ik ii k i k

SE of FuzzyClassifier FunctionFCM

J d y p y wμ μ τ−

= = = =

= + − =∑∑ ∑∑

(3.32)

The objective function to be minimized in (3.32), JmIFC-C , also has a dual

structure. The first term is the same as the objective function of the standard FCM clustering. The second term is the total squared error of the “Interim Fuzzy Classification Function” h(τik,ŵi) of each cluster, i, where τi is the corresponding cluster’s input dataset, i=1,..,c, k=1,..,n, and ,ŵi are the interim fuzzy classification function parameters. This term measures the squared deviation of actual class labels, e.g., yi∈{0,1}, from estimated posterior probabilities, ˆ

ikp (yk=1⎢h(τi,ŵi)).

For any cluster i, the second term takes the form:

2

1 1 1

,

ˆ ( 1)

ˆ ( 1)

i

i

n i n n

y p y

SE

y p y

⎛ ⎞⎡ ⎤=⎡ ⎤⎜ ⎟⎢ ⎥⎢ ⎥= −⎜ ⎟⎢ ⎥⎢ ⎥⎜ ⎟⎢ ⎥⎢ ⎥ =⎣ ⎦ ⎣ ⎦⎝ ⎠

IFC-C algorithm tries to minimize the error of classifier functions using membership values to estimate ˆ

ikp (yk=1)=e, which is close to e≈1. Thus, during

each iteration step, the membership values are reshaped to predict the correct class labels with Interim Fuzzy Classifier Functions, as well as they represent the optimum fuzzy partition of a given dataset. Excluding original input variables from function estimation, we are able to measure and improve the individual effect of the membership values on the performance of each classifier model. If the estimated function can separate the two classes with the given inputs, then the error will be less and the algorithm will converge faster. Since we want to find membership values which can improve classification accuracy, during IFC-C we

3.3 Improved Fuzzy Clustering Algorithm (IFC) 75

only use corresponding cluster’s membership values and their transformations as the only input variables to estimate classifier functions, h(τi,ŵi).

The effect of finding better classifiers by shaping membership values can be explained with an example. During tth iteration of IFC-C optimization, let τik and τik’ represent two vectors, k and k′, randomly selected from the dataset τi, the actual output values of which are y(τik)=1 and y(τik’)=0 respectively. An Interim Fuzzy Classifier Function, h(τi), estimates posterior probability equal to ˆ

ikp =0.90 for τik. Squared error for this kth vector would be, SEik=(1-0.90)2=0.01.

This indicates that interim fuzzy classifier function, h(τik), predicts that output value of kth input vector is 1 with 90% probability, which is quite accurate and its effect on objective function will be quite low. Since we are trying to find the global minimum of the objective function, this would mean fast convergence. Similarly, for the second vector k', let the fuzzy classifier predict a probability of ˆ

ikp ′ =0.15, which indicates that it is very unlikely that the label of vector k′ would

be 1. The SE would be (0-0.15)2=0.025. In conclusion, the better the classifier functions separate the two classes, the smaller the objective function, JIFC-C would get and as a result the faster IFC-C would converge.

Depending on non-linearity of given dataset, one could use user-defined transformations of the membership values, such as exponential transformation

( )impike μ

or power transformation, (μikimp)p, p∈Z. One could use suitable statistical

learning algorithms, e.g., logistic regression (LR), or much effective soft computing approach such as support vector classification (SVC), to approximate fuzzy classifiers. In LR case, for instance, posterior probabilities of each data point in cluster i is calculated by

( )

1

1

,0

,1

( ),

ˆ1ˆ ( 1| ) 1 exp( ( ))

(1 exp( ))ˆ imp

ik

nmik

T

iimpiki

ik k iki ik

i nm

w

wp y

w

w e μ

τ

μτ

τ

+

∈ℜ

⎡ ⎤⎢ ⎥⎡ ⎤⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥= = = + − ⎢ ⎥⎢ ⎥⎢ ⎥+ − ⋅ ⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦⎢ ⎥⎣ ⎦⎢ ⎥⎣ ⎦

, (3.33)

where τik denotes input matrix, the constituents of which are the membership values, and their user defined transformations, wi are the LR parameters to be approximated. If support vector classification (SVC) is used to estimate fuzzy classifier function parameters, the output values of data vectors are calculated by

hi(τi)=∑k βik yk K(τik,τis)+bi , i=1...c, k=1,…n (3.34)

In (3.34), K(.) is the kernel function that is used to map original interim data vectors, τi,, onto a higher dimensional space, either linearly or non-linearly. In this work, we analyzed two different kernel functions: linear; K(τik,τil)=τik

Tτil, where τik and τil are interim vectors of nm dimensions which hold the membership values

76 3 Improved Fuzzy Clustering

and their user defined transformations, and the probabilistic kernel, i.e., radial

basis function (RBF); K(τik,τil)= { }2

2exp ik ijδ τ τ− − (δ>0). βik in (3.34) represents

Lagrange multipliers, one for each interim vector, which are introduced to solve the following SVC optimization algorithm for each local model i (cluster):

( )121 1

10 0 1 1

n n

i ik ik il k l ik ilk k ,l

n

ik k ik regk

Max Q y y K ,

s.t. y , C , k ,l ,...,n,i ...c

β β β τ τ

β β= =

=

= −

= ≤ ≤ = =

∑ ∑∑

(3.35)

Creg is the regularization constant, which balances the complexity of the machine, and the number of the separable data vectors. Interim vectors with non-zero coefficients, βik>0, (Lagrange multipliers) are called “support vectors” obtained for each cluster i, τik

s. The fewer the support vectors, the better is the generalization capacity of such models. A brief summary of support vector machines for classification (SVC) methods are given in Appendix C.2.

The output values obtained from each SVC fuzzy function in (3.34) is a scalar value that one needs to transform into posterior probabilities. We measure the posterior probabilities using improved Platt’s probability method [Platt 2000; Lin et al., 2003], which is based on the approximation of model output labels with a sigmoid function as follows:

ˆikp (yk =1| hi(τi))) = (1+exp(a1 hi(τi)) + a2))

-1 (3.36)

a1 and a2 are found by minimizing the log-likelihood of training data:

( ) ( )1 2

1

1 11 1

2 2

n k kik ikka ,a

y yˆ ˆmin log p log p

=

⎛ ⎞+ +⎛ ⎞− + − −⎜ ⎟⎜ ⎟⎝ ⎠⎝ ⎠

∑ (3.37)

The output values obtained from SVC interim fuzzy functions are not represented as class labels, i.e., yk={-1 or 1} , but rather represented with estimated posterior probabilities.

It should be noted that, fuzzy classifier functions estimated at each iteration, h(τi), is only used for estimating and structuring the membership values during IFC-C optimization algorithm which excludes the original input variables, x∈ℜnv. Membership value calculation equation to update the membership values in each step is re-formulated for IFC-C according to the latter objectives as follows:

( )( )

1/( 1)22

221 11

1

ˆ

ˆ

( , )

( , )

mc

imp ik k ikiki c j

jk k jkk n

k i

k j

d y p

d y p

x

υυ

< ≤ =< <

+ −=

+ −

⎛ ⎞⎡ ⎤⎜ ⎟⎢ ⎥⎜ ⎟⎢ ⎥⎜ ⎟⎣ ⎦⎝ ⎠∑ (3.38)

Generalized steps of the Improved Fuzzy Clustering algorithm for classification models (IFC-C) are re-displayed as follows:

3.3 Improved Fuzzy Clustering Algorithm (IFC) 77

Step 1: Initialize clustering parameters: m (fuzziness parameter), c (parameter of the number of clusters), choose termination threshold, ε>0

Step 2: Initialize prototypes, υi∈V, and membership values, μikimp∈U⊂ℜn×c.

Repeat

Update membership values using (3.38), Update prototypes using (3.31),

Until the change in objective function in (3.32) is less than ε.

In Step 2 of IFC-C algorithm, one may use FCM [Bezdek, 1981a] clustering or a crisp clustering such as k-means to find the initial membership values and cluster prototypes. The IFC-C optimization method searches for the optimum membership values, which are to be used as additional predictors to estimate the parameters of Fuzzy Classifier Functions for the given system. Thus, IFC-C is implemented in structure identification phase and inference phases of the new Improved Fuzzy Functions method (to be presented in Chapter 4). The output of the IFC-C algorithm for any given set of {m,c} values are as follows:

(i) optimal parameters, ŵi, i=1,..,c, of the fuzzy classifier function of each cluster h(τi,ŵi) which is used to calculate the posterior probabilities, ˆ

ip (y=1⎢h(τi,ŵi)) captured from the last iteration step of IFC-C.

(ii) structure of τi, in other words list of different types of membership value transformations that were used to approximate each h(τi,ŵi) in IFC-C,

Depending on the type of function approximation method used, the parameter, ŵi, will represent different values. For instance, when support vector classification is used, the list of parameters of each cluster i, ŵi, would represent Lagrange multipliers of each support vector of each cluster, βik, and original support vectors, τi

S from each cluster, as shown in (3.34). On the other hand, when LR is used, the parameter list is just the coefficients of each membership value (as column vector) to approximate the fuzzy classifier functions as shown in (3.33). Next, we explain why IFC can be a better method than standard FCM clustering for fuzzy function systems.

3.3.4 Justification of Membership Values of the IFC Algorithm

“Are membership values obtained from IFC algorithm better predictors than membership values from FCM clustering algorithm?”

The Improved Fuzzy Clustering (IFC) algorithm to be used for regression and classification problems is a modification of two well-known fuzzy clustering algorithms, i.e., FCM and FCRM. It searches for the optimum membership values, which are later to be used as additional predictors of the “Fuzzy Functions” systems [Turksen, 2007; Celikyilmaz, 2005, Celikyilmaz, and Turksen, 2007b,c].

78 3 Improved Fuzzy Clustering

During the structure identification of the novel fuzzy functions approach, FCM clustering or IFC algorithm is used to partition the data into clusters and assign one membership value to every datum for each of these clusters. For each cluster a separate dataset is formed and using these datasets one Fuzzy Function, i.e., a regression function or a classification function, for each cluster is identified by either one of these clustering algorithms. These membership values are then used as additional predictors in these Fuzzy Functions. The assumption is that, the membership values obtained from the novel IFC, which are used as additional predictors of the Fuzzy Functions, can increase the performance of the models more than the membership values obtained from the FCM clustering algorithm. In this section, the strengths of membership values from the FCM clustering and the IFC algorithm is compared using a simple statistical test. It should be noted that the best comparison can be done when two algorithms are applied to real life or benchmarking dataset and test case performances of each model are compared. Such comparisons are presented in the analysis of experiments chapter (Chapter 6). Here we will demonstrate the performance of the models of the FCM clustering and the IFC method using a small artificial dataset.

Generally speaking, membership value calculation equations are used to obtain membership values of inputs vectors or output variables of singletons, indicating to what degree an object belongs to a cluster. Membership value calculation equations can be a function of inputs, outputs, or input and outputs. Membership value calculation equations can be manifested by very different shapes and some of the common shapes are triangular, Gaussian, trapezoidal or singleton membership values as shown in Figure 3.2.

Fig. 3.2 Types of Membership value calculation equations

In our applications, we use fuzzy clustering methods, which are based on

distance measures between each data point to each cluster center. Since the algorithm is based on minimization of an objective function based on distance

3.3 Improved Fuzzy Clustering Algorithm (IFC) 79

measure, the membership values obtained from scatter diagram which after appropriate curve fitting can be demonstrated as bell-shapes or s-shapes functions of input variables or the output variable for each cluster. It should be emphasized that, for picture representations, idealized (possibly curve-fitted) membership functions are shown to demonstrate the membership values. In fact such membership values ought to be represented with a scatter diagram.

On the other hand, during IFC algorithm, Interim Fuzzy Functions of each cluster, h(τi,ŵi), tries to explain the relationship between membership values and their transformations of a particular cluster and the output value, where the membership values become the input variables to explain the output variable in a linear or a non linear function. Therefore, the relationship between the membership values and their transformations and the output variable to be approximated should be the inverse of the known membership shapes as explained in Table 3.2. In this table, possible examples of relationships between the membership values as input variables and an output variable is shown. To demonstrate the relationship with a “bell-shape” and an “s-shaped” distributions are inverted to draw the membership value versus output variable graphs. When one wants to estimate a user defined linear function for each cluster, one way to obtain best approximations of functions is to transform membership values and then use them as input variables. We assume that the relationship between the membership values and their transformations and the output variable to be estimated would approximately be similar to the inverse of the known membership shapes as shown in Table 3.2.

Table 3.2 Membership values as input variables in Fuzzy Function parameter estimations

Possible function Membership value transformation Membership value as an input variable versus output variable.

S-function-1 :

( )

1( )

1 a byf y

e− +=

+

Inverse s-function-1 : 1 1ˆˆ ( ) ln

uy f u

u− −

≅ ≅ −

0 0.5 1-2

0

2

membership (u)

ou

tpu

t-y

S-function-2 :

( )

1( )

1 a byf y

e +=

+

Inverse s-function-2 : 1 1ˆˆ ( ) ln

uy f u

u− −

≅ ≅

0 0.5 1-2

0

2

membership-u

ou

tpu

t-y

Power function:

( ) bf y ay−=

Inverse power function: 1 ( 1/ )ˆˆ ( ) ( / ) , 0, 0by f u u a a b− −= = > >

0 0.5 1-2

0

2

membership-u

ou

tpu

t -y

80 3 Improved Fuzzy Clustering Here, we want to discuss and demonstrate the significance of the membership

values obtained from IFC algorithm, viz., the confidence of the membership values that covers a portion of the unknown output variables, when used as input variables to approximate an output variable with a function fi, ( )i x yμ → . In the experiment presented next, we used statistical significance tests for each function of each cluster to justify the performance of a membership value calculation equation. An artificial dataset of 50 samples with a single input x, and a single output y, as shown in Figure 3.3, is used in this experiment. We wanted to determine if there are local linear relationships between the response variable y and the independent variables, viz. membership values of each data point in each cluster and their possible transformations.

Fig. 3.3 Scatter plot of the artificial dataset

-3 -2 -1 0 1 2-2

-1

0

1

2

3

x-input

y-o

utp

ut

When the standard deviations are not known, the best way is to use F-statistic

to test if there is a significant relationship between the dependent and independent variables of a system model. The critical Fα,nm,n-2 value for 1-α confidence level, (α ∈{1%,5%,10%}) is captured from statistical graphs. Null hypowork,

0 : , 1,...,icluster criticalH F F i c≤ = , states that there is no significant relationship

between the membership values and their transformations and the output variable of the specified cluster. Rejection of null hypowork, H0 , due to

1 : icluster criticalH F F> , states that at least one of the independent variables

contributes significantly to the model (nm: number of variables). An alternative way to test the significance is to measure the probability, p, obtained from the regression results. p can be compared to the significance level, α. If p<α is correct, then the null hypowork is rejected, and we conclude that the model has significant explanatory power. Otherwise we fail to reject the model. In this small experiment, we tested these F and p values.

In this experiment, standard FCM clustering and IFC algorithm are applied on the artificial dataset using five clusters, c=5, and m=2.0. The membership values obtained from FCM and IFC models are used as independent variables to identify one “Fuzzy Function” for each cluster. We used membership values

3.3 Improved Fuzzy Clustering Algorithm (IFC) 81

and their logistic transformations as independent variables, i.e.,

0, 1, 2,ˆ ˆ ˆ( ) ln((1 ) / )i i i i i i if w w wμ μ μ μ⎡ ⎤= + + ± −⎣ ⎦ , where ŵij, j=1,…,nm are the

coefficients, to identify linear functions for the FCM clustering and IFC models, five functions one for each cluster. The critical Fα,nm,n-2 value for 1-α=%99

confidence level (n=50-2=48 and nm=2) , which is 0.01,2,48 3.20criticalF = , is used to

test the significance of each model. Figure 3.4 shows five fitted regression functions (black surfaces) as linear

hyper-surfaces, one for each cluster. The independent variables, x-axis and y-axis, are membership values from FCM clustering method and their logistic transformations, respectively. The dependent variable, y-output, is the z-axis. Figure 3.5 is a similar graph, but this time the membership values are obtained from the IFC method. The white surfaces indicate the actual observed decision surfaces, and the black surfaces are the modeled hyper-surfaces using linear functions, fi. The two surfaces, white and black, in each figure can be explained as follows:

Actual Decision Surfaces (White Surfaces) in Figure 3.4 and Figure 3.5 are plotted using the membership values obtained from FCM and IFC, respectively and the actual output variable. They represent the actual decision surfaces (the actual relationship between the membership values and the output variable). Since FCM and IFC have different membership value calculation equations, we expect that the actual decision surfaces of FCM in Figure 3.4 will be different than the actual decision surfaces in Figure 3.5. It should be reminded that, this difference between FCM clustering and IFC emerges from the fact that, during IFC optimization, the shapes of membership values are forced to explain the output variable, e.g., in this experiment using interim linear functions of type,

0, 1, 2,ˆ ˆ ˆ( ) ln((1 ) / )i i i i i i if w w wμ μ μ μ⎡ ⎤= + + ± −⎣ ⎦ , ŵij, j=1,…,nm, i=1,…,c. Since we chose a linear function to represent them, we expect that the white decision surfaces of IFC would be more flat (linear) than the decision surfaces of FCM clustering models. One could observe from Figure 3.4 and Figure 3.5 that, actual decision surfaces (white surfaces) of the IFC models are more flat than those of the FCM clustering models in most clusters, i.e., cluster 1, 2, 5. We can also measure the linearity of the inputs and the outputs (actual decision surfaces) using correlation analysis.

In Table 3.3, x1 represents the first input variable that is the membership values, μi , and x2 represents the second input variable that is the logistic transformation of

Table 3.3 Correlation Analysis of FCM clustering and IFC membership values with the output variable

x1 vs. Output x2 vs. Output FCM 0.18 0.21 IFC 0.41 0.32

82 3 Improved Fuzzy Clustering

membership values, ±ln((1-μi)/μi). Each cell in the table represents the root-mean-squares-correlation of each model;

12

21 1, 1,

1

1( , ) ( , ) 0.18, :

cFCM FCM

i i ii

c x y corr x y xc

μ=

⎛ ⎞= =⎜ ⎟⎝ ⎠∑ , y: output variable.

12

22 2, 2,

1

1( , ) ( , ) 0.21, : ln((1 ) / )

cFCM FCM

i i i ii

c x y corr x y xc

μ μ=

⎛ ⎞= = −⎜ ⎟⎝ ⎠∑

12

21, 1, 1,

1

1( , ) ( , ) 0.41, :

cIFC IFC imp

i i i ii

c x y corr x y xc

μ=

⎛ ⎞= =⎜ ⎟⎝ ⎠∑

12

22, 2, 2,

1

1( , ) ( , ) 0.32, : ln((1 ) / )

cIFC IFC imp imp

i i i i ii

c x y corr x y xc

μ μ=

⎛ ⎞= = −⎜ ⎟⎝ ⎠∑ ,

where μiimp represent the improved membership values obtained from the IFC,

and μi represent the membership values obtained from the FCM clustering. For

instance, cFCM(x1,y) represents the overall correlation between the membership values, x1,i=μi, and the output, y, over the five clusters obtained from the FCM clustering algorithm, and cFCM(x2,y) is the overall correlation between the transformed membership values and the output, y, over the five clusters obtained from FCM clustering algorithm. Similarly, cIFC(x1,y) represents the overall correlation between the membership values and the output, y, over the five clusters obtained from the IFC algorithm, and cIFC(x2,y),

2, : ln((1 ) / )imp impi i ix μ μ− , is the overall correlation between the transformed

membership values and the output, y, over the five clusters obtained from the IFC algorithm. It can be observed from Table 3.3 that the correlation between the output variable and the membership values and their transformations obtained from the IFC are higher than those of FCM clustering. So we expect that even with simple methods such as LSE, we can model the output with the IFC better.

Model (Predicted) Decision Surfaces (Black Surfaces) as shown in Figure 3.4 and Figure 3.5 are plotted using the membership values obtained from the FCM and IFC, respectively and the predicted output variable from each functions. They represent the predicted (model) decision surfaces. The F-test and the probability test on the predicted functions are discussed below.

In these five experimental trials, when the FCM clustering algorithm is used to find the membership values, one out of five fuzzy functions (estimated decision surfaces) reveal F-test values that are greater than the critical value, see Table 3.4. So, we fail to reject the null hypowork and we conclude that for this data, the membership values from standard FCM clustering cannot explain the output variable.

3.3 Improved Fuzzy Clustering Algorithm (IFC) 83

0

0.5

1-20-100102030

-1

0

1

2

3

log((1-u)/uu

ou

tpu

t-y

FCM-cluster 1F=2.24, p=0.12

0

0.5

1-40

-200

20

-2

-1.5

-1

-0.5

0

0.5

1

log((1-u)/uu

ou

tpu

t-y

FCM-cluster 2F=0.59, p=0.56

-200

20400

0.5

1-1

0

1

2

ulog((1-u)/u

ou

tpu

t-y

FCM-cluster 3F=3.55, p=0.037

-10 0 10 20 3000.5

1-2

-1

0

1

2

log((1-u)/uu

ou

tpu

t-y

FCM-cluster 4F=1.41, p=0.25

-100

1020

30

0

0.5

1-2

-1

0

1

2

log((1-u)/uu

ou

tpu

t-y

FCM-cluster 5F=1.69, p=0.19

Fig. 3.4 (White Surface) Actual Decision Surface using Membership values (from the FCM clustering), (Black Linear Surface) the estimated linear decision surface. ‘u’ indicates the membership values.

In the same experimental trial, on the other hand, when we analyze the membership values obtained from the IFC algorithm (Figure 3.5), in 3 out of 5 clusters, membership values can explain the output variable much better, since the F-tests are greater than the critical value, see Table 3.4. The majority of the IFC models (3 out of 5) have passed F-significance test. Therefore, we do not fail to reject the null hypowork and we can conclude that the IFC models can model the output variables better than the FCM clustering algorithms for this dataset.

84 3 Improved Fuzzy Clustering

0

0.5

1

-10-5051015

-4

-2

0

2

4

log((1-u)/uu

ou

tpu

t-y

IFC-cluster 1F=16.68, p<0.0001

0

0.5

1

-10

10

20

-2

-1

0

1

2

log((1-u)/uu

ou

tpu

t-y

IFC-cluster 2F=4.61, p=0.01

-50

510

15

0

0.5

1-4

-2

0

2

log((1-u)/uu

ou

tpu

t-y

IFC-cluster 3F=0.37, p=0.69

-100

1020

0

0.5

1-2

0

2

4

log((1-u)/uu

ou

tpu

t-y

IFC-cluster 4F=2.39, p=0.1

-100

1020

0

0.5

1-2

0

2

4

log((1-u)/uu

ou

tpu

t-y

IFC-cluster 5F=13.94, p<0.00001

Fig. 3.5 (White Surface) Actual Decision Surface using Membership values (from the IFC), (Black Linear Surface) the estimated linear decision surface. ‘u’ indicates the improved membership values.

The results from this artificial data show that the IFC models in most cases, three out of five, can find the optimum fuzzy partitions of the input space. Using membership values from the IFC as predictors, one can summarize that in general one can approximate the output variable better than models constructed with

3.4 Two New Cluster Validity Indices for IFC and IFC-C 85

membership values obtained from standard FCM clustering models. One should run different experiments by using non-linear models, or trying different membership value transformations, or use different parameters for m and c, before coming to a more definitive conclusion. Even though this would be an exhaustive algorithm which may take time, one may be able to obtain the optimum solution. Nonetheless, here, we wanted to display that even with simple regression models the IFC models can appear to be more powerful estimators than the FCM clustering results in function estimation problems. In this example, it is shown that the membership values cannot display a significant relationship with the output when FCM clustering is used to obtain them.

Table 3.4 Significance Test Results of fuzzy functions using membership values obtained from FCM clustering and IFC fuzzy functions

FCM IFC F-value* p-value F-value p-value Cluster-1 2.24 0.12 16.68 0.0001 Cluster-2 0.59 0.56 4.61 0.01 Cluster-3 3.55 0.037 0.37 0.69 Cluster-4 1.41 0.25 2.39 0.1 Cluster-5 1.69 0.19 13.94 0.0001

*0.01,2,48 3.20criticalF = is the critical value.

3.4 Two New Cluster Validity Indices for IFC and IFC-C

Fuzzy clustering methods, including the Improved Fuzzy Clustering (IFC) [Celikyilmaz and Turksen, 2007b] algorithm, assumes that some initialization parameters are known prior to the model execution. This is usually a problematic issue since different parameters may reveal different results which could eventually affect system performance. Literature indicates that, many cluster validity functions have been proposed to validate underlying assumptions of number of clusters, especially for the FCM [Bezdek, 1981a] clustering approach. Among well-known validity functions the ones proposed by [Fukuyama and Sugeno, 1989; Xie and Beni, 1991; Pal and Bezdek, 1995; Bezdek, 1976] are the most commonly used FCM clustering validation measures. In later years, many variations of these functions, e.g., [Boguessa et al., 2006; Dave, 1996; Kim et al., 2003; Kim and Ramakrishna, 2005], are presented by modifying or extending these earlier validity functions. These validity functions measure characteristics of a point-wise fuzzy clustering method, i.e., FCM clustering algorithm. They are limited in determining the best clustering structure, but they can provide some information about underlying information of membership values. The main characteristics of these validity functions are that, they all use either within cluster or between-cluster distances or both as a way of assessing clustering schema. The value of within-cluster distances is interpreted as the compactness and the value of between cluster distances is interpreted as the separability of the clustering structure [Kim et al., 2003]. These validity functions, to be summarized in the

86 3 Improved Fuzzy Clustering

following sections, can be categorized into two groups: ratio type and summation type [Kim et al., 2003]. The type of a validity function is determined by the way they combine the within-cluster and between cluster distances.

As stated earlier, most validity indices are designed to validate FCM [Bezdek, 1981a] clustering algorithm. They use characteristics of the FCM to indicate the optimum number of clusters. In this sense, earlier validity indices designed for the FCM clustering method may not be suitable for other variations of fuzzy clustering algorithms, which are designed for different purposes, e.g., Fuzzy C-Regression (switching regression) algorithm (FCRM) [Hathaway and Bezdek, 1993]. For these types of the FCM clustering algorithm variations, different validity measures have been introduced. For instance, in [Kung and Lin, 2004] a new validity index is proposed to measure optimum number of clusters of FCRM applications. Their validity function is a modification of Xie-Beni [1991] ratio type validity function. It accounts for the similarity between regression models using standard inner product of unit normal vectors, i.e., estimated regression equations for each cluster, instead of the distance between cluster centers.

The validity functions should be designed based on the objectives and structure of the corresponding fuzzy clustering methods. In this work, two new fuzzy clustering methods are proposed, i.e., IFC and IFC-C. Therefore in this section we investigate two new cluster validity functions to measure the optimum number of clusters of the models of these new clustering algorithms. The validity functions, to be discussed below, are ratio type indices, which measure the ratio between the compactness and the separability. Since IFC and IFC-C algorithms are new types of clustering methods, which combine two different fuzzy clustering approaches, i.e., FCM clustering [Bezdek, 1981a] and FCRM [Hathaway and Bezdek, 1993] methods in a novel way and utilize “Fuzzy Functions”, the new validity indices are designed to validate two different concepts in the following way. The compactness is used to combine within-cluster distances and the errors between actual and estimated output obtained from c number of regression functions. The separability, on the other hand, is used to determine the structure of clusters by measuring the ratio of cluster center distances to the angle between their regression functions. If two functions of different clusters happen to be parallel to each other, only cluster center distances are to be used as a separability measure.

In the next section, well-known cluster validity indexes that are closely related to the proposed validity measures are reviewed and new cluster validity measures are introduced. Then, using four different artificial datasets, and two real life benchmarking datasets, the performance of new validity function for IFC is compared to three other well-known cluster validity measures, which are closely related to the new validity measure.

3.4.1 Overview of Well-Known Cluster Validity Indices

The literature indicates that there is numerous cluster validity formulas (indexes) designed for different clustering methods. This section presents validity formula designed for two different types of fuzzy clustering algorithms: point-wise clustering, e.g., FCM clustering [Bezdek, 1981a] and regression type clustering, e.g., fuzzy c-regression model (FCRM) type algorithms [Hathaway and Bezdek,

3.4 Two New Cluster Validity Indices for IFC and IFC-C 87

1993]. The research indicates that most of the prominent cluster validity indices are designed for FCM algorithms to find the optimum number of clusters. In a more recent study, Kung & Lin [2004] proposed a new validity index for identification of the optimum number of clusters of fuzzy c-regression model (FCRM) clustering algorithm. The fact that the new IFC [Celikyilmaz and Turksen, 2007b] algorithm combines regression and clustering concepts, we hypothesize that the new validity index should measure the concepts from two different types of cluster validity indexes while trying to indicate the optimum number of clusters. Next, we will investigate both types of validity measures from literature before we present the new cluster validity indices.

Well-Known CVIs for Point-wise and Regression Type Clustering Algorithms

Most cluster validity functions are structured by combining two different clustering concepts [Kim and Ramakrishna, 2005]:

• Compactness: to measure the similarity between cluster elements within each cluster. Most validity indexes use within cluster distances to measure compactness.

• Separability: to measure dissimilarity between each individual cluster. Most validity measures use between cluster center distances to measure separability.

It has been shown that a clustering algorithm is effective when compactness is small and separability is large [Kim and Ramakrishna, 2005]. Based on the way these two concepts are conjunct, cluster validity indices can be categorized into two different types: ratio-type and summation-type. Ratio-type validity indices are formed by measuring compactness to separability ratio, e.g., Xie-Beni [1991] index. The summation-type validity indices combine separability concepts by adding them in many different ways, e.g., Fukuyama-Sugeno [1989] index. Since new validity index proposed in this paper is a ratio-type validity measure, we will give details of prominent ratio-type validity indexes from literature. A well-known ratio-type cluster validity index (compactness/separability) is XB cluster validity index [Xie and Beni, 1991] which is formulated by:

2 2

21 1

2,

( , ) /

( ) , ( , )min ( , )

c n

ik k ii k

k i k ii j i i j

d x v n

XB c d x v x vd v v

μ= =

⎛ ⎞⎜ ⎟⎝ ⎠= = −∑∑

(3.39)

where xk∈ℜnv represents the kth input vector, k=1,…,n andυi∈ℜnv , i,j=1,…,c, represents cluster center as a vector of nv dimensions. XB decreases monotonically when c is close to total number of data samples, n. Kim and Ramakrishna [2005] discuss the relationship and behavior of compactness and separability between clusters obtained from the FCM clustering method for changing values of number of clusters. In [Kim et al., 2003; Kim and Ramakrishna, 2005], the relationship between compactness and separability is demonstrated using graphs. Figure 3.6 is adopted from [Kim and Ramakrishna,

88 3 Improved Fuzzy Clustering

2005]. According to their generalization, compactness increases sharply as c decreases from coptimal to coptimal-1. This means that for c<coptimal compactness will be large and for c> coptimal compactness will be small. Compactness will be zero when each object is a cluster of itself, i.e., c=n. A sudden drop in compactness would be the indicator of the coptimal.

number of clusters

co

mp

actn

ess,

se

pe

rab

ilit

y

c-optimal

separability

compactness

Fig. 3.6 Compactness and Separation concepts of ratio-type CVI index

From the analysis of results of FCM clustering, one observes that each cluster may have different compactness values because some clusters are more/less dense than the others. As the number of clusters is increased/decreased, the change in compactness of those clusters would be different (bigger or smaller) than the rest of the clusters. In XB validity index, compactness of overall clustering structure is determined (the numerator in (3.39)) by averaging compactness of every cluster. However, averaging might suppress the effect of large changes in compactness values of some clusters. These changes are usually caused by FCM clustering models with undersized (oversized) number of clusters. Therefore, to exploit these huge compactness value shifts during determination of the optimum number of clusters, the best way is to interpret the compactness of a model by measuring the maximum compactness of clusters. On the other hand, from Figure 3.6 it can be observed that the relative changes of compactness and separability are somewhat similar when c≠ coptimal. Therefore, their effect on ratio-type validity indices should be somewhat similar, i.e., they should both show an increasing/decreasing behavior at the same c values. Concerning the latter discussion, in [Kim and Ramakrishna, 2005] an improved version of XB validity index is proposed as follows:

221,..,

1*

2

,

max

( )min

n

i c ik k ik

n

i j i i j

x

XB c

μ υ

υ υ

==

⎧ ⎫⎪ ⎪−⎨ ⎬⎪ ⎪⎩ ⎭=

(3.40)

3.4 Two New Cluster Validity Indices for IFC and IFC-C 89 It was proved in [Kim and Ramakrishna, 2005] that XB* index is more

effective than XB index, because with XB* one can detect clusters if they have large compactness values. With this information one can determine the optimum number of clusters by observing these ambiguities in clustering structure. Hence, XB* will be the starting point of our new cluster validity formula for IFC algorithm.

On the other hand, Kung and Lin [2004] formulated a validity index to validate fuzzy c-regression (FCRM) [Hathaway and Bezdek, 1993] type clustering approaches, i.e., fuzzy adaptation of switching regression problems. FCRM clustering algorithms models identify c regression model parameters and a partition matrix (membership values matrix), which is interpreted as importance or weight attached to error measure between actual output and each regression model output. The validity measure in [Kung and Lin, 2004] is based on XB index; however, compactness is measured by the error between the observed output and the output obtained from a linear or polynomial regression function of each cluster. The separability is measured by inverse dissimilarity between clusters defined by absolute value of the standard inner-product of unit normal vectors, which represent c hyper-planes. Kung-Lin’s cluster validity index is formulized as follows:

22

1 1

/

( )

1

max ,

c nT

ik i ki k

i ji j

x y n

Kung Lin c

u u

μ β

κ

= =

⎛ ⎞−⎜ ⎟

⎝ ⎠− =⎡ ⎤⎢ ⎥⎢ ⎥+⎢ ⎥⎣ ⎦

∑∑, where

β μ μ⎡ ⎤ ⎡ ⎤⎣ ⎦ ⎣ ⎦-1T T

i i i= x x x x y

(3.41)

Numerator in (3.41) is a compactness measure and denominator represents separability. ui’s represent unit normal vector of each c-regression function. FCRM models [Hathaway and Bezdek, 1993] are represented with regression equations and therefore their corresponding unit vectors are defined by

[ ]ii

i

nu

n= , 1

1 ,[ 1] nvi i i nvn β β += − ∈ℜ (3.42)

ni indicates regression function parameters, βi,nv, in a vector form and ||.|| is Euclidean norm, nv indicate number of variables of input dataset, x =[x1,…,xnv]. The inner product of two unit vectors of two clusters equals to the value of cosine of the angle between them. This value is used to measure the separability of c regression functions of Kung and Lin’s [2004] validity formula. When functions are orthogonal, separability is maximized. They have shown that, their validity function is a good indication for FCRM models.

90 3 Improved Fuzzy Clustering

3.4.2 The New Cluster Validity Indices

In this work, two new ratio-type cluster validity indices, cviIFC, and cviIFC-C are presented to validate novel Improved Fuzzy Clustering (IFC) [Celikyilmaz and Turksen, 2007b] and IFC for classification problems, IFC-C [Celikyilmaz and Turksen, 2007i] Firstly, we will focus on IFC for regression problems. In IFC algorithm, clusters are identified by cluster prototypes (centers) and their corresponding regression functions. Membership values calculated from IFC algorithm represent the degree to which each object belongs to each prototype. They are also used as candidate input variables, which can help to identify relationships between input and output variables with regression functions. It should be reminded that, each input vector is mapped onto a new feature space using membership values and/or their user-defined transformations as new additional inputs. A new dataset is formed for each cluster in feature space and then one “Fuzzy Function” is estimated using these datasets. It is therefore expected that, membership values obtained from IFC algorithm can explain output variable, in other words be “good” predictors of the regression functions, as well as represent better fuzzy partitions in the feature space.

When validating IFC algorithm, one needs to find a relation between the compactness and separability of clusters by measuring the clustering structure as well as the relationships between their representative regression functions. The new validity measure should include both of these concepts when validating the number of clusters of IFC models. The compactness of the new validity measure will combine two terms. (1) XB* compactness (numerator in (3.40)) will be used as the first term of compactness and (2) a modified version of compactness of Kung-Lin index (numerator in (3.41)) as a second term. The second term of compactness of cviIFC will represent the error between regression model and actual output. Regression models are the “Fuzzy Functions”, f(Φi,Ŵi) y, where Φi is a matrix of input variables and their membership values in the corresponding cluster and their transformations, and Ŵi are the regression coefficients, which are the mappings of the input space onto a new space using membership values. On the other hand, the separability of cviIFC will couple the angle between regression functions and the distance between each cluster center prototypes.

In the new validity function, original scalar inputs also enter as predictors of functions when measuring the compactness. This can be explained as follows: IFC algorithm finds membership values, μi

imp, i=1,…,c, that can predict the local models of a given system. (imp) indicates that membership values are from improved clustering (IFC). These membership values and/or their transformations are also used together with the original input variables to determine “Fuzzy Functions” for each cluster using suitable function approximation method, e.g., least squares, support vector machines, ridge regression, etc., in system modeling with fuzzy functions (to be discussed in Chapter 4). The novel IFC algorithm introduces membership values and their transformations as additional predictors along with the original input variable in order to minimize the error of the local models of each cluster. For this reason, here optimum number of clusters of the

3.4 Two New Cluster Validity Indices for IFC and IFC-C 91

new IFC algorithm is validated by analyzing the behavior of new membership values in addition to the original input variables in regression functions. The following steps describe the configuration of the new validity analysis:

(i) A different dataset is structured for any cluster i by using membership values (μi

imp) and/or their transformations as additional predictors. This is the same as mapping original input space, x∈ℜnv of nv different input variables, onto a higher dimensional feature space ℜnv+nm, i.e., x Φi(x,μi

imp)∈ℜnv+nm, for each cluster i, i=1,..,c. Hence, each data vector is represented in (nv+nm) feature space. nm is the number of augmented membership values and their potential transformations as new predictors to the original input space. Below, a special feature space is shown which is formed by mapping the original input matrix of one-dimensional inputs onto a new space of (nv+1+1) dimensions and a single output using only the membership values by

Φi(x, μiimp) =

1 1 1

impi

impi n n n

x y

x y

μ

μ

×

×

⎡ ⎤ ⎡ ⎤⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥⎣ ⎦⎣ ⎦

∈ℜnv+nm

(nv=1, nm=1). If it is a need, we use mathematical transformations of membership values such as (μik

imp)2, (μikimp) m, exp(μik

imp), ln((1- μikimp)/μik

imp), etc., where m represents the degree of fuzziness of IFC.

(ii) We then fit a regression function for each cluster using their corresponding input datasets, Φi(x,μi

imp). Therefore, new cluster validity index designed for the IFC clustering is formulated as;

( )( )

2* 2

11,..,

*2

*,*

2

,

1 ˆmax ( ( , ))

min , , ,( ) 1

min ,

n mik k k i k i i iki c

i j i j i ji j i

i,j=1,..,c,j i

i j i i j

vc x y v y f Wn

vccviIFFv v if 0

c vsvs

v v otherwise

μ

α α α α

==

≠≠

⎫= − + − Φ ⎪⎪⎪⎧ =− ≠ ⎬⎪ ⋅ +⎪= ⎨ ⎪⎪ − ⎪⎩ ⎭

∑,

(3.43)

where vc* represents compactness and vs* represents separability of the new validity

measure. Let inΦ , 1 2 ( 1) ( )ˆ ˆ ˆ ˆ ˆ, ,.., , ,..., nv nm

i i i inm i nm i nm nvn W W W W WΦ ++ += ∈ℜ ,

represent the normal vector, i.e., the vector orthogonal to the plane, of fuzzy functions, obtained from given dataset in the feature space, Φi(x,μi

imp)∈ℜnv+nm .

iα in the |⟨αi,αj⟩|∈[0,1] represents unit normal vector of each “Fuzzy Function”, i,

[ ] /i i in nα Φ Φ= , where inΦ is the length of the vector and αi represents the unit

vector. Absolute value of inner product of two unit vectors of fuzzy functions of two clusters, i,j=1,…,c, i≠j, equals to the value of cosine of the angle between them:

92 3 Improved Fuzzy Clustering

( ) ( ) ( ) ( )1 1 ( ) ( )

, 2 2 2 2

1 ( ) 1 ( )

ˆ ˆ ˆ ˆ...,cos ,

ˆ ˆ ˆ ˆ... . ...

i j i nm nv j nm nvi j

i j i j

i ji i nm nv j j nm nv

W W W Wn n

n n W W W Wθ α α

Φ Φ+ +

Φ Φ

+ +

⎡ ⎤+ +⎣ ⎦= = =+ + + +

(3.44)

When two cluster centers are too close to each other due to oversized number of clusters, the distance between them becomes almost (≅0) invisible, then the validity measure goes to infinity. To prevent this, the denominator of cviIFC in (3.43) is increased by 1. In cviIFC-C for IFC-C models, “Fuzzy Functions” in vc*, fi(Φi(x,μi

imp),Ŵi), are replaced with posterior probabilities obtained from classifier models as follows:

( ) ( )2* 21,..,

1

1 ˆ ˆmax ( ( 1 | ( ( , ), )))n mimp imp

i c ik k k i k i i i ik

vc x y v y P y f x Wn

μ μ==

⎧ ⎫= − + − = Φ⎨ ⎬

⎩ ⎭∑ (3.45)

Compactness measure, vc* (numerator of cviIFF) of the new validity function, which is different from the XB* compactness criteria, includes an additional term; squared error of Fuzzy Functions. Nevertheless, the compactness, vc* has still the similar affect on the outcome as the compactness of XB* index as shown in Figure 3.6 for the following reasons. When c>c-optimum, compactness of clusters will be small. This is due to the fact that, as the number of clusters is increased, within cluster distances will decrease since clusters will include more similar objects. In addition there is one regression function estimated for each cluster. Similarly, as the number of functions is increased, the error of the “Fuzzy Functions” will decrease, because regression model output will approach to actual output. Compactness will be zero when c=n, which is when every object becomes its own cluster center and one function, passes through every object. On the other hand, when c<c-optimum, clusters will include dissimilar objects together, which will increase the first term of compactness in cviIFC. Since there will be less functions than actual number of models, the deviation between actual and estimated error will be high. Therefore, if the number of clusters is less than optimum number of clusters, compactness will be high. There will be a decrease in compactness as the number of clusters converges to c-optimum. Therefore, as the c increases from c=c-optimum to c=n then the compactness will gradually converge to zero.

As discussed earlier, each cluster may have different compactness values. New validity index, cviIFC, as shown in (3.43) shares the same concern as the XB* [Kim and Ramakrishna, 2005] that averaging the compactness values of individual clusters could suppress the effect of clusters with high compactness when the number of clusters is small. When c<c-optimum, the difference between maximum and minimum compactness of clusters will be large. Whereas as the c>=c-optimum, the compactness of each individual cluster will all be very small, so that the difference between maximum and minimum compactness of clusters will be negligible. Hence, we can identify sudden changes by analyzing the maximum compactness values instead of average compactness values. These are assumptions under the optimum conditions and new cluster validity index combines the measure of the separability of clusters within the new validity index to increase the precision of approximation of optimum number of clusters.

3.4 Two New Cluster Validity Indices for IFC and IFC-C 93 Separability of the new validity index also combines the separability measure

obtained from two different structures, i.e., regression and clustering. Between cluster distances are represented as Euclidean distances between cluster centers in (3.43). Additionally, absolute value of cosine of the angle between each “Fuzzy Function” |⟨αi,αj⟩|∈[0,1] is used as an additional separability criteria. If functions are orthogonal, then they are the most dissimilar functions, which is an attribute of an optimum model. The separability, i.e., denominator of cviIFC in (3.43), conditionally combines between-cluster-distances and angles by taking the ratio between them. If an angle between any two function is zero, then they are parallel to each other. So we used the minimum distance between their cluster centers to represent the separability. Better separability would then yield better clustering results.

Figure 3.7 is used to explain the changes in separability based on different patterns in the dataset. When two clusters are apart from each other, i.e. the distance between their cluster centers are farther than the rest of the clusters, e.g., cluster1 (Δ’s) and cluster 2 (•’s) in Figure 3.7, no matter what the angle between their regression functions are, their separability will be large. Separability will get larger, if the two vector lines are orthogonal, i.e., cosine of the angle approaches to zero as the value of the angle approaches to 900. On the other hand, when clusters are so close to each other, e.g., cluster 2(•’s), cluster 3 (*’s) and cluster 4 (o’s) in Figure 3.7, the angle between their functions would be dominant separability identifier. If vectors are close to being orthogonal vectors, i.e., cosine of the angle between them is very close to 0 as in cluster3 and cluster 4, then the minimum separability would be very large even if their distance is close to zero. When two

-5 -3 -1 1 3 55-20

-10

0

10

20

30

40

50

1

2

4

3

Fig. 3.7 Four Different Functions with Varying Separability Relations

94 3 Improved Fuzzy Clustering

cluster centers are close to each other and the functions are almost parallel to each other, i.e., the value of the cosine of the angle would be close to 1, e.g., cluster 2(•’s) and cluster 3 (*’s), then the separability will be small.

Using the artificially created dataset, to be presented next, we will justify that the presented cviIFC and cviIFC-C validity measures are good indicators of the new clustering schema, i.e., IFC and IFC-C respectively. The performance of the new validity measures will be investigated in comparison to three other well-known validity criteria that were mentioned above.

3.4.3 Simulation Experiments [Celikyilmaz and Turksen, 2007i;2008c]

In this section we present simulation experiments to measure the performance of cviIFC and cviIFC-C methods by applying IFC and IFC-C methods on artificially created datasets of different structures as well as real datasets. We also apply three different well-known cluster-validity indices to justify the strength of the new cluster validity measures. Later, we will discuss the results obtained form validity experiments.

Experiment 1

In order to demonstrate the performance of the new validity indices, we constructed tests on datasets with known structures (number of clusters, or patterns of functions, etc.,). We introduce a dataset structure similar to [Kung and Lin, 2004], but we formed 4 different datasets containing different number of linear models. Each dataset is created using different number of functions with varying Gaussian noise, εl,m(l=1..4, m=1,..,nf) having zero means and variance of 0.9 in each function. m represents each linear function such as in Table 3.5 to be used to generate each dataset l, l=1..4.

Firstly, 400 training input vectors, x, which are uniformly distributed in the range of [-5, +5], are generated randomly to form the first dataset. Then the dataset is grouped into 4 separate parts, of 100 observations. Each function from Table 3.5(A) is applied to one group to obtain output values, e.g., 4-clustered datasets are formed using four separate local models. Following the same convention, we generated 500, 490 and 450 more training vectors, x, separately, which are also uniformly distributed in the range of [-5, +5] to generate the datasets with 5, 7 and 9 patterns using the corresponding functions in Table 3.5 (B), (C), and (D), respectively. Using 500, 490 and 450 training vector sets; we applied each of the 100, 70 and 50 observations to the 5, 7, and 9 functions correspondingly to form 3 more single input- single output datasets, e.g., dataset2, dataset3, and dataset4.

We applied IFC algorithm [Celikyilmaz and Turksen, 2007b] onto these 4 datasets separately using 2 different degree of fuzziness values, m=1.3, and m=2.0 and 14 different number of clusters, c=2,…,15. We measured the new cluster validity index, cviIFC as well as XB, XB*, and Kung-Lin validity measures using the membership values obtained from these IFC models.

3.4 Two New Cluster Validity Indices for IFC and IFC-C 95

Table 3.5 Functions used to generate Artificial Datasets

(A) 4-cluster (dataset1) (C) 7-cluster (dataset3) (D) 9-cluster (dataset4)

1 1 1 1,1

2 2 2 1,2

3 3 3 1,3

4 4 4 1,4

2 5

0.2 5

1

8

T

T

T

T

y x x

y x x

y x x

y x x

β ε

β ε

β ε

β ε

= = + +

= = − +

= = − + +

= = − +

(B) 5-cluster (dataset2)

1 1 1 2,1

2 2 2 2,2

3 3 3 2,3

4 4 4 2,4

5 5 5 1,5

2

7 5

1

14

5 6

T

T

T

T

T

y x x

y x x

y x x

y x x

y x x

β ε

β ε

β ε

β ε

β ε

= = +

= = − +

= = − + +

= = + +

= = − − +

1 1 1 3,1

2 2 2 3,2

3 3 3 3,3

4 4 4 3,4

5 5 5 3,5

6 6 6 3,6

7 7 7 3,7

2

7 5

1

14

5 6

6

3 25

T

T

T

T

T

T

T

y x x

y x x

y x x

y x x

y x x

y x

y x x

β ε

β ε

β ε

β ε

β ε

β ε

β ε

= = +

= = − +

= = − + +

= = − +

= = − − +

= =

= = − +

1 1 1 4,1

2 2 2 4,2

3 3 3 4,3

4 4 4 4,4

5 5 5 4,5

6 6 6 4,6

7 7 7 4,7

8 8 8 4,8

9 9 9 4,9

3 2

0.5 1

3 3

2

2 9

2

0.5

0.2

T

T

T

T

T

T

T

T

T

y x x

y x x

y x x

y x x

y x x

y x x

y x x

y x x

y x x

β ε

β ε

β ε

β ε

β ε

β ε

β ε

β ε

β ε

= = +

= = + +

= = + +

= = − +

= = − + +

= = − − +

= = − +

= = − +

= = − + +

Figure 3.8 illustrates the values obtained from the four cluster validity functions using membership values of the IFC models on dataset1 for changing values of number of clusters, c, and two fuzziness values, m=1.3 and 2.0. m=1.3 is used to represent a more crisp model where overlapping of clusters are negligible, while m=2 is a model where the local fuzzy clusters (models) overlap to a degree of 2.0. In an analogical manner, Figure 3.9, Figure 3.10, Figure 3.11 compare the values obtained from these four cluster validity measures using the membership values of the IFC models on dataset2, dataset3 and dataset4, respectively. Hence, it is expected that c* would be 4, 5, 7, and 9 for dataset1, dataset 2, dataset3 and dataset4, respectively.

0 4 8 12 150

5

10

15

number of clusters

XB

0 4 8 12 150

2

4

6

8

number of clusters

XB

*

0 4 8 12 15150

0.2

0.4

0.6

0.8

number of clusters

Ku

ng

-Lin

0 4 8 12 150

0.025

0.05

0.075

0.1

0.125

0.150.15

number of clusters

cviF

F

m=1.3m=2.0

m=1.3m=2.0

m=1.3m=2.0

m=2.0m=1.3

Fig. 3.8 Dataset 1 - Cluster validity measures, XB, XB*, Kung-Lin and cviIFC, versus c for two m values;m=1.3 (.-) and m=2.0(*-) using 4-patterned dataset

96 3 Improved Fuzzy Clustering

0 5 10 150

20

40

60

number of clusters

XB

0 5 10 150

2

4

6

8

number of clusters

XB

*

0 5 10 150

0.2

0.4

0.6

0.8

number of clusters

Kung-L

in

0 5 10 150

0.1

0.2

number of clusters

cviF

F

m=1.3m=2.0 m=2.0

m=1.3

m=1.3m=2.0

m=1.3m=2.0

Fig. 3.9 Dataset 2- Cluster validity measures, XB, XB*, Kung-Lin and cviIFC, versus c for two m values; m=1.3 (.-) and m=2.0(*-) using 5-patterned dataset

1 4 7 10 13 150

2

4

6

number of clusters

XB

1 4 7 10 13 150

0.25

0.5

0.75

1

number of clusters

Kung-L

in

1 4 7 10 13 150

0.5

1

1.5

number of clusters

cviF

F

1 4 7 10 13 150

1

2

3

number of clusters

XB

*

m=1.3m=2.0

m=1.3m=2.0

m=1.3m=2.0

m=1.3m=2.0

Fig. 3.10 Dataset 3- Cluster validity measures, XB, XB*, Kung-Lin and cviIFC, versus c for two m values; m=1.3 (.-) and m=2.0(*-) using 7-patterned dataset

Experiment 2

In order to explain the strength of the proposed validity measure on real datasets, we applied the proposed IFC on a dataset of historical stock prices of a major Canadian Financial Institution. The aim in this experiment is to predict the last two months’ stock prices using estimated models obtained from the previous 10 month stock prices. In real datasets, just as in artificial datasets, there are hidden

3.4 Two New Cluster Validity Indices for IFC and IFC-C 97

1 4 7 10 13 15150

2

4

6

numer of clusters

XB

0 3 6 9 12 15150.2

0.4

0.6

0.8

1

number of clusters

Ku

ng

-Lin

0 3 6 9 12 15150

0.05

0.1

0.150.15

number of clusters

cviF

F

0 3 6 9 12 150

1

2

3

number of clusters

XB

*

m=2.0

m=1.3

m=1.3m=2.0

m=2.0m=1.3

m=2.0m=1.3

Fig. 3.11 Dataset 4-Cluster validity measures, XB, XB*, Kung-Lin and cviIFC, versus c for two m values; m=1.3 (.-) and m=2.0(*-) using 9-patterned dataset

4550

5560

65

40

50

60

700

0.5

1

1.5

2

2.5

3

3.5

x 10-3

EMA 50 dayBollinger-Band 20 Day

den

sity

Potential Cluster #1

Potential Cluster #2

Fig. 3.12 Density graph of stock price dataset using two financial indicators. Two components that are well separated are indicated.

components viz., clusters, within which input-output relations can be locally identified. Since we do not have a prior conception of the actual number of clusters of real datasets, we can conduct an exhaustive search by changing the number of clusters and keeping other parameters constant and investigate the

98 3 Improved Fuzzy Clustering

performance of each model. The optimum model would be the one with the best performance, e.g., the least model error, etc. We identify the optimum number of clusters based on the number of cluster of the optimum model. Here, we used the proposed fuzzy system modeling tools based on type-1 Improved Fuzzy Functions, (T1IFF) to be discussed in chapter 4. T1IFF system modeling tools implement novel IFC method to find hidden structures in a given dataset. Improved membership values are obtained from IFC to be used as additional predictors to identify the local fuzzy functions. We iterated the T1IFF for different values of the number of clusters, keeping the rest of the parameters constant and captured the optimum number of clusters, c*, based on the minimum error criteria.

At this point we should pause and explain the multi-modal structure of real systems. One can observe hidden structures in a given system by just analysing the probability distributions of input variables. For instance, in Figure 3.12 the Gaussian density graphs of the given stock price dataset is illustrated using two financial indicators. Two separate components can easily be noticed from the graph. The degree of overlap between these components might affect the number of clusters perceived by the human operator. We expect to distinguish between any hidden overleaping clusters in real datasets using the cviIFC based on the structure of the system.

In this experiment, stock prices collected over 12 months are divided into two parts. Indicator values from the first ten months, i.e., from 27 July, 2005 to 11 May, 2006, are used to train models and to optimize model parameters. The last two months, i.e., from 11 May-21 July 2006, are holdout for testing the model’s performances. Experiments were repeated with 20 random subsets of above sizes. Model performances using holdout datasets are measured using root-mean-error-square (RMSE) and averaged over 20 repetitions. Financial indicators such as moving average, exponential moving average are used as predictors. Detailed explanations on the financial indicators of such experiments are given in the chapter of experiments.

We applied an exhaustive search method to identify the optimum T1IFF model by changing values of number of clusters, bounded with c≤ n*(1/10), where n is the number of training samples. Improved membership values, μimp, their logit transformations, i.e., log((1-μimp)/μimp), and exponential transformations, i.e., exp(μimp), are used as additional dimensions (input variables) to approximate fuzzy functions in a new feature space. Based on the minimum RMSE values, the average optimum c, c*, of the best T1IFF models over 20 repetitions is identified as c*=4.9±1.77. In short, we found that this stock price dataset may consist of c*∈[3, 6] structures. We now want to test if the new cluster validity index could have found the optimum c*, by just applying IFC clustering without executing the exhaustive search method using T1IFF strategy.

To validate the latter statement, we calculated the values of the proposed validity index, cviIFC, as well as XB, XB* and King-Lin indices using membership values obtained from the proposed IFC models. We plotted validity index values measured for different c values and two fuzziness values, m=1.3 and

3.4 Two New Cluster Validity Indices for IFC and IFC-C 99

2 4 6 8 100

2

4

6

8

number of clusters

XB

2 4 6 8 100

1

2

3

number of clusters

XB

*

2 4 6 8 100

0.005

0.01

0.015

number of clusters

Ku

ng

-Lin

2 4 6 8 100

0.5

1

1.5

number of clusters

cviF

F

m=1.3m=2.0

m=1.3m=2.0

m=1.3m=2.0

m=1.3m=2.0

Fig. 3.13 Stock Price Dataset - Cluster validity measures, XB, XB*, Kung-Lin and cviIFC, versus c for two m values; m=1.3 (.-) and m=2.0(*-)

2.0. The CVI graphs in Figure 3.13 indicates the values of different CVI measures using membership values from IFC algorithm applied on stock price dataset.

Experiment 3

In addition to the first two experiments above, we wanted to validate the performance of the novel Improved Fuzzy Clustering for classification (IFC-C) clustering algorithm using the new cviIFC-C method. Since proposed IFC-C is specifically designed for binary classification problems; we have chosen the Ionosphere classification dataset from UCI repository [Newman et al., 1998] to demonstrate the performance of the new cviIFC-C. The targets of ionosphere dataset take on the values of “good” or “bad” indicating the free electrons in the ionosphere. "Good" radar returns are those showing evidence of some type of structure in the ionosphere. "Bad" returns are those that do not; their signals pass through the ionosphere. The details of this datasets are given in chapter of experiments (Chapter 6).

In this experiment, again the actual number of clusters that the Ionosphere dataset might hold, viz., overlapping or discrete structures that represent multiple classification surfaces of the ionosphere dataset, is an unknown parameter. Therefore we applied our proposed Improved Fuzzy Functions approach (T1IFF-C) for classification problems (to be discussed in Chapter 4) on the ionosphere dataset. We only changed the number of clusters keeping the rest of parameters constant. The list of parameters of previous T1IFF models of experiment 2 that

100 3 Improved Fuzzy Clustering

are used to model stock prices is also used to build different models T1IFF-C of this experiment. The experiment is repeated 10 times. The results have yielded that the average c* is around c*=3.5±0.92, viz, c*∈[3-5] based on the average highest classification accuracy over 10 iterations. To validate the optimum number of clusters, c*, we applied the proposed cviIFC-C as well as XB, XB* and King-Lin indices on the outcome of 10 different IFC-C models and took the average of the values of each individual validity measure. Figure 3.14 demonstrates the results for two fuzziness values, m=1.3 and 2.0.

2 4 6 8 100

200

400

XB

number of clusters2 4 6 8 10

0

20

40

60

XB

*

number of clusters

2 4 6 8 100

0.02

0.04

0.06

0.08

Ku

ng

-Lin

number of clusters2 4 6 8 10

0

0.2

0.4

0.6

0.8

cviF

F-C

number of clusters

m=1.3m=2.0

m=1.3m=2.0

m=1.3m=2.0

m=1.3m=2.0

Fig. 3.14 Ionosphere Dataset- Cluster validity measures, XB, XB*, Kung-Lin and cviIFC-C, versus c for two m values; m=1.3 (.-) and m=2.0(*-)

3.4.4 Discussions on Performances of New Cluster Validity Indices Using Simulation Experiments

We have reached the following conclusions based on the analysis of the results of the experiments on 4 different artificial datasets and a real life dataset of prediction nature (stock price dataset). We analyzed the outcome of IFC to measure the values of cviIFC along with the values of the known cluster validity measures; XB, XB*, and Kung-Lin. The discussions also include the analysis of models of IFC-C on a binary classification dataset, i.e., Ionosphere dataset. For these experiments we measured cviIFC-C as well as XB, XB*, and Kung-Lin to validate the IFC-C model results.

Analysis of Experiment 1. In Table 3.6 the predicted optimum number of clusters obtained from four different validity measures using 4 different artificial datasets for two different fuzziness levels, i.e., m=1.3 and m=2.0, are shown.

3.4 Two New Cluster Validity Indices for IFC and IFC-C 101

Table 3.6 Optimum number of clusters of artificial datasets for m∈{1.3,2.0}

Actual # of Clusters

Dataset 1 c*=4

Dataset 2 c*=5

Dataset 3 c*=7

Dataset 4 c*=9

m=1.3 XB 2-6 2-6, 11 5 2, 8 XB* 2-9 2-7, 11 10 4 Kung-Lin 9 10 8 9,10 Proposed cviIFC 4,5 4-5 7 9 m=2.0 XB 2-6 2-6, 11 3, 5, 6 4, 7,9 XB* 2-8 2-9 8 9-10 Kung-Lin 9, 13 10 6 9,10 Proposed cviIFC 4 5,6 6,7 9

The results as shown in Table 3.6, which are obtained from the application of

the proposed cviIFC on 4 different datasets (Figure 3.8 - Figure 3.11), indicate that cviIFC can successfully suggest the optimum number of clusters. It is also not effected by changing values of the levels of fuzziness, i.e., m=1.3 and m=2.0. It converges at the optimum number of clusters, and then it asymptotes to zero as the number of clusters goes to infinity, e.g., c=15 in this case. In all of the four datasets, the new validity criteria captures c* at the actual value, and then converges to zero as c>c*.

The difference between cviIFC and XB* is that, proposed cviIFC is expected to show an asymptotical behaviour towards larger number of clusters, whereas XB* validity index can increase or decrease but c* should be where index is at its minimum value. When actual number of clusters are small, as in dataset1 and dataset2, XB* cannot directly identify c* for either level of fuzziness values. For larger c values, e.g., dataset3 in Figure 3.10, XB* can identify the actual number of clusters more precisely. We can conclude that when the number of different components (clusters, different patterns) in a given dataset are large, XB* can validate the IFC clustering applications to a certain degree. When the system has fewer models, XB* is not the optimum validity measure of number of clusters for the IFC method.

As Kung-Lin (2004) cluster validity index asymptotes to zero, the knee-point is the indicator for the optimum number of clusters, c*. From the cluster validity graphs of four artificial datasets demonstrated in Figure 3.8-Figure 3.11, Kung-Lin index is unable to identify the optimum number of clusters for small-clustered datasets and is capable of identifying c* for datasets with larger number of clusters to a certain extent.

In [Kim and Ramakrishna, 2005], it was proven that the XB index is inefficient to identify the optimum number of clusters, c*, of FCM clustering models. We wanted to test if it can identify the c* of the IFC models. The minimum XB values

102 3 Improved Fuzzy Clustering

indicate the predicted c*. XB index was unable to identify an exact c* in most of the datasets. It can however give a wider range which includes the optimum c*, however, this information is not enough to determine a precise value for the optimum number of clusters.

Analysis of Experiment 2. The optimum number of clusters was estimated to be c*∈[3,6] based on the models of T1IFF fuzzy system where proposed IFC is used for structure identification on real stock price dataset [Celikyilmaz and Turksen, 2007b]. Hence, we applied the proposed cviIFC function and averaged over 20 repetitions. In Figure 3.13 the values obtained from the cluster validity indexes of four different validity measures are shown using two levels of fuzziness values. The results are summarized in Table 3.7. The values of the proposed cviIFC indicate that c* should around 4 or 6 for two different m values. Kung-Lin index can also correctly validate c*. In addition, XB index can somewhat identify c* within the interval of c*∈[4-8]. Among 4 different cluster validity measures, the cviIFC has still the closest estimations to the actual c*.

Table 3.7 Optimum Number of Clusters of IFC models of stock price dataset identified by different validity indices

Actual # of Clusters [3,6] m=1.3 XB 6-8 XB* 4, 6-8 Kung-Lin 4 Proposed cviIFC 4,6 m=2.0 XB 5,7 XB* 5,7 Kung-Lin 4 Proposed cviIFC 4,6

Analysis of Experiment 3. In experiment 3, we wanted to demonstrate that the new cviIFC-C, specifically designed for the classification type datasets, could actually identify the optimum number of clusters of a real dataset, viz., ionosphere dataset from UCI repository. From exhaustive analysis based on models of IFF-C for changing c values and two different levels of fuzziness, we obtained that c* should be c*∈[3-5]. From the validity graphs shown in Figure 3.14, we obtained the optimum number of clusters from each cluster for two different fuzziness values and listed in Table 3.8. It is to be noted that cviIFC-C is able to identify overlapping clusters and the elbow indicates that optimum c* should be 4 or 6. Kung-Lin index could also identify the approximate c*. None of the other cvi measures were able to confidently identify c* for changing values of fuzziness.

3.5 Summary 103

Table 3.8 Optimum Number of Clusters of IFC models of Ionosphere dataset indicated by different validity indices

Actual # of Clusters [3,6] m=1.3 XB 6-9 XB* 8-9 Kung-Lin 6 Proposed cviIFC 6 m=2.0 XB 4,7,8 XB* 7, 8 Kung-Lin 6 Proposed cviIFC 4

3.5 Summary

In data mining practices, one of the challenges one encounters is that the datasets under study may have different properties, e.g., number of variables, number of data vectors, noise level, number of different model structures, the dependent variable characteristics, etc., that different approaches should be followed to get better results from the representative models of a system. Based on these characteristics of datasets, different fuzzy clustering algorithms have been introduced in the literature to find the hidden structures in them. If one designs a fuzzy clustering algorithm for an image processing dataset, which composes of the pixel intensities, the neighborhood pixel intensities might be one of the important measures that the clustering algorithm should utilize. Similarly, for a web log dataset, similarity based on recursive occurrences of corresponding words will be the most discriminative property. The fuzzy clustering should include corresponding similarity measures based on the type of the dataset under study. In this chapter different fuzzy clustering methods are reviewed and the structural differences between them are discussed.

Most importantly, in this chapter a new improved clustering algorithm is proposed to be used for a new fuzzy system modeling approach, namely “Improved Fuzzy Functions” approaches. The logic behind proposed improved fuzzy clustering algorithm is that, feature spaces are not only related to input space but also to input-output mapping space, so the given data objects could be grouped by considering local models of the dataset as well as linear relationship between input and output variables. Therefore, new improved clustering algorithm combines regression type clustering algorithms to find membership values and their transformations that may help to explain local model relationships and basic clustering structure, which helps to a separate datasets into meaningful clusters at the same time. We also present an extension of the new fuzzy clustering method for classification problem domains.

104 3 Improved Fuzzy Clustering This chapter also deals with one of the problems common to almost all of the

clustering algorithms, that is to say, determination of optimum number of clusters. In current literature, since there are numerous fuzzy clustering methods, many different cluster validity measures are proposed. Especially there are numerous validity measures just to identify the optimum number of clusters of the well-known Fuzzy C-Means Clustering algorithm. In this chapter two new cluster validity indices are introduced corresponding to the two new fuzzy clustering methods in order to validate their structures. Experimental analysis on artificial and real life datasets showed that the new cluster validity indices can help to identify optimum number of clusters of the models of the new improved fuzzy clustering methods.

We preferred to present the experimental analysis on the new validity indices in this chapter instead of displaying them in the chapter of experiments. The main reason is that in chapter 4 and chapter 5 a new evolutionary system modeling structure based on proposed fuzzy functions approaches will be introduced to dynamically optimize the number of clusters based on stochastic search. Hence, the proposed approaches in Chapter 4 and Chapter 5 will be used to optimize the number of clusters with genetic algorithms.