a note on “fuzmar: an approach to aggregating market research data based on fuzzy reasoning” by...

3
ELSEVIER Fuzzy Sets and Systems 93 (1998) 381 383 FU|ZY sets and systems A note on "FUZMAR: An approach to aggregating market research data based on fuzzy reasoning" by R.R. Yager et al. Malay Bhattacharyya Indian Institute of Management, Prabandh Nagar, Off Sitapur Road, Lucknow 226 013, India Received March 1996; revised July 1996 Abstract This paper establishes a connection between the work of Yager et al. [Fuzzy Sets and Systems 68 (1994) 1-11] and known statistical modelling techniques. Based on this connection, the author suggests a new way for selecting an appropriate model in the framework of Yager et al., which has some advantages over the selection criterion of Yager et al. © 1998 Elsevier Science B.V. Key words: Data analysis methods; Shannon entropy; deviance; chi-square distribution 1. Introduction In their paper, Yager and others [-3] have intro- duced an approach for validating models involving linguistic variables. There are two aspects to their work. First, they have provided a method using fuzzy reasoning for combining variables taking lin- guistic values in a linearly ordered scale. This leads to different alternative models. Second, they have provided a method based upon the principle of minimal entropy to choose one from amongst the competing models. At the end of their work, they have said that" ... there is a need to develop a means to determine if the values of H tkl (the entropy values) are significantly different from each other". This pa- per addresses this last point mentioned by them. In this paper, it is shown that the Shannon entropy, as used by Yager et al., is a linear function of deviance, a measure of goodness of fit, originally introduced by Nelder and Wedderburn (see [-2, Section 2.3]). As deviance follows approximately a chi-square distribution, it can be used more meaningfully and with statistical validity for selecting the "best" model in the framework of Yager et al. 2. The measure of deviance and its relationship with entropy In Section 2 in [3], Yager et al. have essentially dealt with observations or data relating to two categorical variables. It is clear that the "overall structure" in Fig. 2 of Section 2 in [-3] is the same as the most widely used classical structure of a two- way contingency table in statistics, with values of V as the row categories and those of U as the column categories. In terms of the contingency table, let us write f~ = the number of respondents corresponding to the ith row and the jth column, Cj = the jth column sum, i.e., the total number of respondents in the jth column, S0165-0114/98/$19.00 ~:) 1998 Elsevier Science B.V. All rights reserved PII S01 65-01 14(96)00226-6

Upload: malay-bhattacharyya

Post on 02-Jul-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

E L S E V I E R Fuzzy Sets and Systems 93 (1998) 381 383

FU|ZY sets and systems

A note on "FUZMAR: An approach to aggregating market research data based on fuzzy reasoning" by R.R. Yager et al.

M a l a y B h a t t a c h a r y y a

Indian Institute of Management, Prabandh Nagar, Off Sitapur Road, Lucknow 226 013, India

Received March 1996; revised July 1996

Abstract

This paper establishes a connection between the work of Yager et al. [Fuzzy Sets and Systems 68 (1994) 1-11] and known statistical modelling techniques. Based on this connection, the author suggests a new way for selecting an appropriate model in the framework of Yager et al., which has some advantages over the selection criterion of Yager et al. © 1998 Elsevier Science B.V.

Key words: Data analysis methods; Shannon entropy; deviance; chi-square distribution

1. Introduction

In their paper, Yager and others [-3] have intro- duced an approach for validating models involving linguistic variables. There are two aspects to their work. First, they have provided a method using fuzzy reasoning for combining variables taking lin- guistic values in a linearly ordered scale. This leads to different alternative models. Second, they have provided a method based upon the principle of minimal entropy to choose one from amongst the competing models. At the end of their work, they have said t ha t " ... there is a need to develop a means to determine if the values o f H tkl (the entropy values) are significantly different f rom each other". This pa- per addresses this last point mentioned by them. In this paper, it is shown that the Shannon entropy, as used by Yager et al., is a linear function of deviance, a measure of goodness of fit, originally introduced by Nelder and Wedderburn (see [-2, Section 2.3]). As deviance follows approximately a chi-square

distribution, it can be used more meaningfully and with statistical validity for selecting the "best" model in the framework of Yager et al.

2. The measure of deviance and its relationship with entropy

In Section 2 in [3], Yager et al. have essentially dealt with observations or data relating to two categorical variables. It is clear that the "overall structure" in Fig. 2 of Section 2 in [-3] is the same as the most widely used classical structure of a two- way contingency table in statistics, with values of V as the row categories and those of U as the column categories. In terms of the contingency table, let us write f~ = the number of respondents corresponding to the ith row and the j th column, Cj = the j th column sum, i.e., the total number of respondents in the j th column,

S0165-0114/98/$19.00 ~:) 1998 Elsevier Science B.V. All rights reserved PII S01 65-01 14(96)00226-6

382 M. Bhattacharyya / Fuzzy Sets and Systems 93 (1998) 381-383

Ri = the ith row sum, i.e., the total number of respondents in the ith row, and N -- total number of respondents.

According to their notations, f-j = nj~, Cj = n j, and N = n. It is easily seen that the overall uncer- tainty or the expected entropy used by them is the same as

Second, the higher the association or dependence between the two variables U and V, the stronger the predictive power of the knowledge regarding U in predicting the values of V. Conversely, if U and V are independent then a knowledge of U is not much useful for the prediction of V. It is well known that one measure of association between two categorical variables or attributes is the Pear- son chi-square. The other measure, which is not so well known, is the measure of deviance introduced by Nelder and Wedderburn [-2]. For a two-way contingency table the deviance is defined as follows (see Section 4 in [2]):

G2 = 2 ~ ~ f iuln (f-ii-~, (2) i j \ e u /

where e u is the expected frequency for the cell corresponding to the ith row and the j th column. The expected frequency is calculated as e u = (Ri × C j ) / N , under the assumption that U and V are independent.

By using the above expression for e u and re- arranging terms, it is easily seen that (2) can be re- written as

: 2 2 + 2 y . (3) i j " "

Using (1) and (3), we have

G 2 = - 2 N H - 2 ~ R , l n ( ~ ) . (4)

Now, note that the second term in Eq. (4) is invariant with respect to the choice of the aggrega- tion function that determines the values of the predictor variable U (see Sections 3 and 5 in I-3]). This is so because R~, the total number of respon- dents selecting the ith value of the variable V, is

constant for all i, whereas the Cj are dependent on the choice of the aggregation function that deter- mines the values of the variable U. So, by letting

we can write

G 2 = - 2 N H - 2 K . (6)

Thus we have shown that there is an inverse

linear relationship between G 2, the deviance, and H, the entropy measure, as used by Yager et al.

3. Use of deviance for model selection

A low value of G 2 indicates low association be- tween the variables U and V. Obviously, the lower the association, the weaker the predictive power of U. So, loosely speaking, the model which has the highest G 2 value should be considered as the "best". The values of G 2 corresponding to the five different models considered by Yager et al. are shown in the third column of Table 1, along with the values of H [k] (see [3, Section 5]).

First, note that unlike the values of H, those of G 2 a r e quite far apart from each other and, there- fore, their differences are more discernable and are less likely to create doubts regarding the superior- ity of one model over another.

Second, when the assumption of independence is true and the sample size is relatively large, G 2 is approximately distributed as chi-square with de- grees of freedom (d.f.) given by

d.f. = (number of rows - 1)

x (number of columns - 1).

Table 1 Comparison of H, G 2, and Prob[G 2 > G~]

k H tkl G 2 Prob[G z > G~]

5 0.5149 1.8809 0.3905 4 0.5190 0.5537 0.7582 3 0.5207 0.0165 0.9918 2 0.5143 2.0389 0.3608 1 0.5086 3.8401 0.1466

M. Bhattacharyya / Fuzzy Sets and Systems 93 (1998) 381 383 383

The hypothesis of independence is rejected if the calculated value of G 2 is greater than the critical value obtainable from the standard published table at a suitable significance level. Selecting an arbit- rary significance level of 0.05, the critical value is 5.9915. Thus, at a 5% level of significance, the hypothesis of independence between U and V is not rejected for any model. Hence, from a strict statistical point of view, none of the five models considered by Yager et al. is really suitable for predicting the values of V from those of U. Statistically speaking, the model that has a G 2

value greater than the critical value at a suitable level of significance should be considered for prediction.

However, another way of tackling the problem is to calculate the probability that G 2 will be greater than the calculated sample value, under the hy- pothesis that the variables are independent. The lower the probability, the better the corresponding model for prediction. These probabilities for Yager's problem are shown in the fourth column of Table l, for the five different models.

4. Concluding remarks

In respect of model selection, all the three cri- teria, viz. H, G 2, and Prob[G 2 > G 2] give the same conclusion. However, from the point of view of comparing the values and selecting the "best" model, both G 2 and Prob[G 2 > G 2] seem to per- form better than the entropy criterion used by Yager et al.

Finally, and most importantly, the two new cri- teria suggested in this paper, namely G 2, the devi- ance, and Prob[G 2 > G2], provide a statistically valid framework for selecting the "best" model.

References

[1] R.V. Hogg and E.A. Tanis, Probability and Statistical Infer- ence (Macmillan, New York, 1989).

[2] J.A. Nelder and R.W.M. Wedderburn, Generalized linear models, J. Roy. Statist. Soc. Ser. A 135 (1972) 370 ~384.

[3] R.R. Yager, L.S. Goldstein and E. Mendels, FUZMAR: An approach to aggregating market research data based on fuzzy reasoning, Fuzzy Sets and Systems 68 (1994) 1 11.