46-577-883

6
Calculating the sample size of multivariate populations: Norms of representation CONSTANTINOS N. TSIANTIS Department of Energy Technology Technological Educational Institution of Athens Agiou Spyridonos Street, Aegaleo 12210 Athens GREECE [email protected] Abstract: - This article is touching the problem of representation of a mathematical space and treats the problem of sampling as a problem of representation. It makes the distinction between population representation and statistical representation, and considers statistical representation as the product of population representation and statistical (behavioral) factor . It produces the formula of representation of a population N (consisted of number of classes with given number of subjects per class), giving the sample size . It produces the formula of statistical representation as product of and of statistical factor. Application of the first formula justifies the number of representatives at the Vouli of ancient Athens while application of the second formula gives results similar to those in statistical bibliography. m ath n bath n ath n Key-Words: - Representation, Sample, Athenian Norm, Statistical Factor, Allocation, Stratification, Minimum Variance. 1 Introduction The existence of a mathematical formula giving the number of persons required to represent a community of citizens is a task of high significance for the political and social sciences in a democratic society. Such a formula has not be known up today even though probably it had been used in determining the number of representatives for the parliament (Vouli) of ancient Athens. The calculation of sample size belongs to the central issues of statistics and influences the validity of research outcomes and research cost as well. Modern statistics has provided us with formulas and tables for determining the sample size required to make comparisons among population groups [1],[2],[3] by using the concept of effect size and the assumption of normal distribution as far as the measurable characteristics of subjects. Although, the effect size and the assumption of normal distribution are not usually known beforehand [4] and, hence, previous statistical data are required. The same demand holds also for the method of stratified sampling which accompanies the principle of representation and the condition for minimum variance. During the last decades, computer advancements and programming (nQuery, PASS, SAT, etc.) have provided a spectrum of numerical methods in facing the problem of sampling. Although, this plethora of approaches has established a relativism of solutions and has triggered a time consuming process in dealing and choosing among alternatives. This situation is not functional for the practitioner of statistics at the various scientific fields (social sciences, education, psychology, biostatistics, environmental sciences, physical sciences, etc.) and, in addition, it preserves an undesirable relativism in mathematical epistemology which demands definite and close solutions for its problems. This article is an endeavour to consider the problem of sample size as a problem of representation and develop close mathematical formulas which reduce relativism and strict the number of assumptions underlying the calculation of sample size. AMERICAN CONFERENCE ON APPLIED MATHEMATICS (MATH '08), Harvard, Massachusetts, USA, March 24-26, 2008 ISSN: 1790-5117 301 ISBN: 978-960-6766-47-3

Upload: na-dnazaie-code

Post on 15-Sep-2015

5 views

Category:

Documents


3 download

DESCRIPTION

multivariat

TRANSCRIPT

  • Calculating the sample size of multivariate populations: Norms of representation

    CONSTANTINOS N. TSIANTIS

    Department of Energy Technology Technological Educational Institution of Athens Agiou Spyridonos Street, Aegaleo 12210 Athens

    GREECE [email protected]

    Abstract: - This article is touching the problem of representation of a mathematical space and treats the problem of sampling as a problem of representation. It makes the distinction between population representation and statistical representation, and considers statistical representation as the product of population representation and statistical (behavioral) factor . It produces the formula of representation of a population N (consisted of number of classes with given number of subjects per class), giving the sample size . It produces the formula of statistical representation as product of and of statistical factor. Application of the first formula justifies the number of representatives at the Vouli of ancient Athens while application of the second formula gives results similar to those in statistical bibliography.

    m athn

    bathn athn

    Key-Words: - Representation, Sample, Athenian Norm, Statistical Factor, Allocation, Stratification, Minimum Variance.

    1 Introduction The existence of a mathematical formula giving the number of persons required to represent a community of citizens is a task of high significance for the political and social sciences in a democratic society. Such a formula has not be known up today even though probably it had been used in determining the number of representatives for the parliament (Vouli) of ancient Athens. The calculation of sample size belongs to the central issues of statistics and influences the validity of research outcomes and research cost as well. Modern statistics has provided us with formulas and tables for determining the sample size required to make comparisons among population groups [1],[2],[3] by using the concept of effect size and the assumption of normal distribution as far as the measurable characteristics of subjects. Although, the effect size and the assumption of normal distribution are not usually known beforehand [4] and, hence, previous statistical data are required. The same demand holds also for the

    method of stratified sampling which accompanies the principle of representation and the condition for minimum variance. During the last decades, computer advancements and programming (nQuery, PASS, SAT, etc.) have provided a spectrum of numerical methods in facing the problem of sampling. Although, this plethora of approaches has established a relativism of solutions and has triggered a time consuming process in dealing and choosing among alternatives. This situation is not functional for the practitioner of statistics at the various scientific fields (social sciences, education, psychology, biostatistics, environmental sciences, physical sciences, etc.) and, in addition, it preserves an undesirable relativism in mathematical epistemology which demands definite and close solutions for its problems. This article is an endeavour to consider the problem of sample size as a problem of representation and develop close mathematical formulas which reduce relativism and strict the number of assumptions underlying the calculation of sample size.

    AMERICAN CONFERENCE ON APPLIED MATHEMATICS (MATH '08), Harvard, Massachusetts, USA, March 24-26, 2008

    ISSN: 1790-5117 301 ISBN: 978-960-6766-47-3

  • 2 Problem 2.1 The representation of a population A population of size consists of m mutually exclusive and exhaustive classes of subjects, with

    subjects in class 1, subjects in class 2,..., and subjects in class m. What is the value of

    and the synthesis , ,, of a sample, drawn randomly from the population, so that the minimum valid representation of the population to be achieved ?

    N

    1N 2N

    mN n1n 2n mn

    2.2 The statistical representation A population of size is consisted of m mutually exclusive and exhaustive classes of subjects, with subjects in class 1, subjects in class 2,..., and subjects in class m. The subjects are measured in terms of some variable of interest X and the statistical parameters for the population and its classes (i.e., means and SDs) are considered known. A sample of size , consisted of

    , ,, subjects respectively per class, is drawn randomly from the population. What is the value of and its allocation ( , ,, ) so that, on the basis of available information (data), the minimum valid representation of the population to be achieved?

    N

    1N 2N

    mN

    n1n 2n mn

    n 1n 2n mn

    3 Problem Solution 3.1 The principle of representation Statistical methodology has used various ideas and strategies to extract a sample from a population. A guiding principle for doing this is the principle of representation suggested and developed by mentors in the field [5],[6]. Related to this principle is the principle of random sampling and the method of stratification as well. The principle of representation implies, in essence, the demand of similarity between the synthesis of a space of interest (population) and the synthesis of its representative space (sample). This similarity is expressed by the equality of respective class proportions between the space of interest and the representative space: ):1( miw ii == . The same thing can be stated probabilistically as follows: the probability that an element of class-i of space has to be found in it equals to the probability that an element of representative class-i has to be found in

    the representative space. 3.2 Solution of the problem of population representation 3.2.1 Notations-Definitions N the total size of population consisted of m classes of subjects m the number of classes- the same in the population and the sample

    1N , ,, the number of subjects per population class

    2N mN

    NNNN m =+++ ...21 (Eq.1)

    1w , ,..., the percentage of each class in the population

    2w mw

    NNw 11 = , N

    Nw 22 = ,, NNw mm = (Eq.2)

    1...21 =+++ mwww (Eq.3) n the sample size (under calculation)

    1n , ,, the number of subjects per class in the sample

    2n mn

    nnnn m =+++ ...21 (Eq.4)

    1 , 2 ,..., m the percentage of each class in the sample

    nn1

    1 = , nn2

    2 = ,, nnm

    m = (Eq.5) 1...21 =+++ m (Eq.6) 3.2.2 Deriving the Athenian norm of representation ( proportional to ) in iN The solution of Problem 2.1 (find the sample size

    and its synthesis , ,, representing a

    population with synthesis , ,..., and ) is formulated as follows:

    n 1n 2n mnN 1N 2N mN

    If is the required size of the sample and

    the number of subjects included in the first sample class, then, the proportion of subjects of class-1 in the sample is

    n

    1n

    nn1

    1 = (Eq.7)

    AMERICAN CONFERENCE ON APPLIED MATHEMATICS (MATH '08), Harvard, Massachusetts, USA, March 24-26, 2008

    ISSN: 1790-5117 302 ISBN: 978-960-6766-47-3

  • The probability that a subject (of whatever class) from the population N be represented in the sample

    is n

    Nnp = (Eq.8)

    The probability that a subject of class-1 from the population N be represented in the sample equals to the product

    11pn

    (Eq.9) pPp prob111 =where is the probability that a subject of class-1 be found in the population N, that is

    1probp

    11

    1 wNNpprob == (Eq.10)

    Thus, the probability , through Eqs.8 and 10, becomes

    11p

    Nnwp 111 = (Eq.11)

    By applying Eq.11 for the subjects of class-1, we take the probability that subjects of class-1 of population N be represented in the sample n (no matter in which sample class), that is

    1n

    np1 1n

    Nnwnp n 111 = (Eq.12)

    But, according to the principle of representation, the above probability must be equal to the probability implied by the respective class proportion

    nn /11 = (Eq.5). That is,

    Nnwn

    nn

    111 = (Eq.13)

    Therefore, for the subjects of class-1 (through Eq.13) we take (Eq.14) Nwn =12We repeat the above process for all the classes of population, which are mutually exclusive and exhaustive. Thus, for the classes consisted of , ,, subjects respectively

    (having proportions , ,..., ) the following system of equations is formed:

    mi ,...,2,1=1N 2N mN

    1w 2w mw

    (Eq.15)

    =

    ==

    Nwn

    NwnNwn

    m2

    22

    12

    ............................

    By multiplying the above equations in parts, we take the relation

    (Eq.16) mmm Nwwwn =...212

    From Eq.16 we take then directly the size of the sample:

    n

    m

    mwwwNn

    ...21= (Eq.17)

    The above formula (Eq.17) indicates that the sample size for the case of population representation depends not only upon the size of population N, but also upon the way that N is distributed among the population classes, as indicating by (Eqs.2). Since, as is explained below (see application 3.2.3), Eq.17 justifies the number of representatives making the Vouli (Parliament) of ancient Athens, it is called the Athenian norm of representation and

    is signified here by the symbol .

    swi '

    n athn The synthesis of the sample , ,, (the number of subjects per class in the sample) is given then by the equation:

    1n 2n mn

    ====

    mi

    nnwnNNn athiathiathii

    ,...,2,1

    (Eq.18)

    3.2.3 Application 1 The Athenian parliament (Vouli) was established by Solon in 594 B.C and originally was consisted of 400 men (one hundred men from each of the four tribes). Cleisthenes (508 B.C.) expanded the number of representatives to 500 (50 mean from each of the 10 municipalities /demoi of Attica). Membership was restricted during that time to the top three of the original four property classes (the nobles/ Pentacosiomedimnoi, the knights/ Hippes and the farmers /Zeugitae, not the Thetes) and to the male citizens over the age of thirty. According to Sinclair, the number of citizens in the city of ancient Athens (males, females and children) was estimated to 120000 around the 480 B.C. and to 160000-170000 around the 431 B.C. (beginning of the Peloponnesian war). The number of male citizens who had completed the 30th year of age and were permitted to participate at the Vouli was about 30000 in 480 BC and 40000 in 431 B.C. [7]. The sum percentage of the top two classes, according to Glotz [8], was about 6.0 % of the male citizens ( + =0.06, with about 3%). The majority of citizens were small farmers (zeugitea), whose percentage can be derived by the equation

    1w 2w 2w

    3w

    AMERICAN CONFERENCE ON APPLIED MATHEMATICS (MATH '08), Harvard, Massachusetts, USA, March 24-26, 2008

    ISSN: 1790-5117 303 ISBN: 978-960-6766-47-3

  • 213 1 www = . If we apply the method of population representation for the above three civic classes, by ranging N from 30000-40000 and and

    from 0.0255 to 0.0325 (using some computer program), then an area of possible solutions is identified which for approaches the number of 500 representatives (465 for N=30000, 502 for N=35000 and 537 for N=40000).

    1w

    2w

    94.03 =w

    3.3 Solution for the problem of statistical representation 3.3.1 Additional Notation X the variable of interest the mean for the whole population

    1 , 2 ,..., m the mean for each population class the standard deviation for the whole population

    1 , 2 ,, m the standard deviation for each population class x the mean for the whole sample

    1x , 2x ,..., mx the mean for each sample class s the standard deviation for the whole sample

    1s , ,, the standard deviation for each sample class.

    2s ms

    3.3.2 Deriving the norm of statistical representation: proportional to in iiN (Neyman allocation) The procedure followed for deriving the fundamental formula (Eq.17) can be repeated appropriately to derive the formula for the case of statistical representation. This representation incorporates, besides the population representation, the statistical factor which is expressing the measurable characteristics (behaviour) of subjects in relation to some variable of interest X. We select here as statistical parameter to represent these characteristics (behavioural data) the standard deviation (SD). To proceed with the solution, we consider and

    N as elements of two distinct independent

    subspaces, upon which we can work separately demanding at the same time that their product be represented by the sample . n We rewrite for this purpose Eq.10 ( )

    and Eq.11 ( ), by taking into account that:

    NNPprob /11 =Nnwp /111 =

    1) The standard deviation for a variable X (like its variance ) has two sources: i) the standard deviations

    21 , 2 ,, m of subjects within each

    class, and ii) the standard deviations between classes. 2) The proportion by which a SD unit from class-i ),...,2,1( mi = contributes to the sum of SDs within classes is

    m

    ii

    +++= ...21 (Eq.19) where 1...21 =+++ m (Eq.20) Through this definition of si ' we have the same unit measuring the SDs within classes and, simultaneously, the probability (proportion) by which the behavioural component of each class contributes to the overall behaviour (sum of SDs). 3) The sum m +++ ...21 participates with percentage /)...( 21 m+++ to the SD of the total population. One unit of the within classes SD particpates, therefore, with percentage /1 . 4) The probability that the SD proportion 1 (of class-1) make presence to the SD of the total population equals then to the product /1*1 . 5) We consider now an element of class-1 from the subspace 11N . The probability that this element make presence on the total space N equals to the product of probabilities of its constituent subspaces, that is: )/(*)/( 11 NN . In other words

    NNPprob 111 = (Eq.21)

    6) Then, the probability that an element of class-1 of space

    11pN be represented in the sample

    equals to n )/(*111 NnPp prob= . Having elements, the respective probability becomes equal to

    1n

    NnPnp probn /111 = . 7) But, according to the principle of representation, the above probability must correspond to the fist sample class and be equal to the respective proportion nn /11 = . That is,

    Nn

    NNn

    nn

    11

    11 = (Eq.22)

    and, therefore,

    1111

    2

    11

    22

    )(

    wN

    NwN

    NNn === (Eq.23)

    AMERICAN CONFERENCE ON APPLIED MATHEMATICS (MATH '08), Harvard, Massachusetts, USA, March 24-26, 2008

    ISSN: 1790-5117 304 ISBN: 978-960-6766-47-3

  • By repeating the above process for all classes , a system of equations is formed: mi :1= m

    == mi

    wNn

    ii

    :12 (Eq.24) (Eq.24)

    The coexistence of above equations provides the required sample size. For this case of statistical representation we signify it by the symbol : bathn

    fn

    wwwNn athm

    mm

    mbath = ...... 2121

    (Eq.25) Eq.25 implies that the sample size for the case of statistical representation is the product of population representation expressed by and of statistical-

    behavioural factor defined by the equation athn

    f

    m

    m

    f

    ...21 (Eq.26)

    The statistical factor , since f is usually unknown, can be replaced by the respective sample factor (which can be calculated from some previous measurement)

    m

    mS

    sf ...21

    (Eq. 27)

    The synthesis of the representative space (sample) , ,, , in order to be here similar to the

    synthesis of space (principle of representation), must follow the equations:

    1n 2n mn

    =++=mi

    nNN

    Nn bathmm

    iii

    ,...,2,1...11

    (Eq.28)

    It is well known that the sample allocation expressed by Eq.28 is the one which ensures minimum variance [6,9]. The calculation of sample size, therefore, for the case of statistical representation requires, in addition to Nis, the knowledge of SDs. Since the SDs of population classes m ,...,, 21 are usually unknown, they can be replaced by their respective sample estimates . If the SDs for the sample classes are not given, then a little less sample size is calculated when these are taken as

    equal. For this case, the behavioral factor becomes

    msss ,...,, 21

    smfS = and the per class sample size becomes bathii nwn = . What is needed here is only the

    knowledge of the SD for the whole population or the sample. The allocation bathii nwn = reflects the case of simple stratified sampling which ensures variance smaller than that of simple random sampling but bigger than that of optimal allocation (Eq.28 ). It is obvious that when the statistical factor becomes unit, then Eq.25 degenerates into Eq.17 and bathath nn . 3.3.3 Application 2: Comparison to Demings example Deming [6] in his classical book Some theory of sampling (pp.233-4), states an application example on the method of Neymans sampling ( proportional toin iiN ).

    Table 1. Description of the Universe

    Stratum limits in terms of

    total assets

    (1000$)

    Number of corporations

    iN

    Estimated average

    net income

    (x 1000$)

    i

    Standard deviation of net

    income (x 1000$)

    i

    Unknown Under 50

    50-99 100-249 250-499 500-999

    1000-4999

    5600 28700 11100 13000 7500 5100 5800

    1 1 5

    15 50 100 300

    5 5 8

    20 65

    130 390

    Demings example is rephrased here as follows: A program is planned with purpose to collect financial data (such as sales, market cost of goods, income) from the American manufacturing corporations. For checking the reliability (accuracy) of the under collection data, the project administration decided to set the net income of each corporation as the controlling criterion of data reliability. A sample, henceforth, was designed with purpose to estimate the precision of net income and, in consequence, the accuracy of rest financial parameters. Demings calculations of sample size (7600) and synthesis are expressed is the results illustrated in Table 2. To compare the method of statistical

    AMERICAN CONFERENCE ON APPLIED MATHEMATICS (MATH '08), Harvard, Massachusetts, USA, March 24-26, 2008

    ISSN: 1790-5117 305 ISBN: 978-960-6766-47-3

  • representation, proposed in this article, we calculated the sample size through the following steps: (1) Calculation of proportions , for

    and . NNw ii /=

    mi ,...,2,1= 7=m(2) Calculation of si ' , by Eq.19 (3) Calculation of , by Eq.17 3336.805=athn(4) Calculation of , by Eq18 ( ). sni ' athii nwn =(5) Calculation of variance (on the base of above

    ) through the general equation of variance (not that of minimum variance)

    sni '

    =m

    i

    ii

    i

    ii N

    nNn

    w1

    222

    1 (Eq.29)

    This provided SD = 4.0138. (6) Calculation of the statistical-behavioural factor, by Eq.26 (Eq.27): =9.4660 Sf(7) Calculation of product , giving

    total sample size , and Sathbath fnn *=

    7623bathn(8) Allocation of according to Eq.28. bathnThe results for both methods and the percent of divergence between calculations are given in Table 2.

    Table 2. Comparison of sample size calculations Stratum limits in terms of

    total assets

    (1000$)

    Demings calculation

    in

    Calculation through the proposed formula

    ibathn ,

    Percent of divergence

    Between calculations

    Unknown Under 50

    50-99 100-249 250-499 500-999

    1000-4999

    54 277 172 502 942 1281 4372

    54 278 172 504 945 1285 4384

    0 0.36%

    0 0.39% 0.32% 0.31% 0.27%

    All classes

    7600 7622 0.29%

    4 Conclusion 1. The principle of representation may reshape many types of sampling problems. In this article, by distinguishing the problem of population representation from that of statistical representation, we found results similar to those of statistical bibliography. 2. The problem of population representation (required minimum size) has a unique solution which is provided by the Athenian norm of representation

    ( ). The second problem of single statistical representation includes the problem of population representation and its solution incorporates automatically the condition of minimum variance permitted by the available information. If the population representation is known, the statistical representation can be achieved with good accuracy by knowing only the SD of the population or sample as a whole.

    athn

    3. The proposed formula for the population representation is a straight forward one, does not demand the statistical factor, is easily applicable and introduces a new point of view for the project of political-social sciences. This formula is considered as a fundamental one, since is the decisive multiplication factor in the formula of statistical representation

    athn

    fnn athbath *= . The allocation of subjects implied by the last formula comes directly from the application of representation principle (not from an optimization procedure), a fact which together with the automatic satisfaction of condition of minimum variance generates also promising ideas for the field. References: [1] Cohen, Jacob, Statistical power analysis for the behavioural sciences, New York: Academic Press, 1969, [231, 243, 248, 252, 314]. [2] Kirk, Roger E. Introductory Statistics, Wadsworth Publishing, 1978. [3] Julious, Steven A Tutorial in Biostatistics: Sample sizes for clinical trials with Normal data. Statist. Med. 2004; 23:19211986. [4] Conover, W. J., Practical Nonparametric Statistics, 2nd ed., John Wiley & Sons, 1980 [5] Neyman, Jerzy. On the two different aspects of representative method: The method of stratified sampling and the method of purposive selection, Journal of Royal Statistical Society, Vol.97, No.4 (1934), pp.558-625. [6] Deming, W. E. Some theory of sampling, New York: Dover Publications, 1966, p.226-230 (originally published by John Wiley in1950). [7] Sinclair, R.K. Democracy and participation in Athens, Cambridge University Press, 1988. [8] Glotz, Gustave, Ancient Greece at Work, (tr. by M. R. Dobie and E. M. Riley), New York: Barnes & Noble, 1968. [9] Tryfos, Peter. Sampling methods for applied research, New York: John Wiley, 1996, p.98.

    AMERICAN CONFERENCE ON APPLIED MATHEMATICS (MATH '08), Harvard, Massachusetts, USA, March 24-26, 2008

    ISSN: 1790-5117 306 ISBN: 978-960-6766-47-3