working paper n. 2011-38 2011wp.demm.unimi.it/tl_files/wp/2011/demm-2011_038wp.pdf · working paper...
TRANSCRIPT
DIPARTIMENTO DI SCIENZE ECONOMICHE AZIENDALI E STATISTICHE
Via Conservatorio 7 20122 Milano
tel. ++39 02 503 21501 (21522) - fax ++39 02 503 21450 (21505) http://www.economia.unimi.it
E Mail: [email protected]
GENERATING ORDINAL DATA
PIER ALDA FERRARI ALESSANDRO BARBIERO
Working Paper n. 2011-38
DICEMBRE 2011
Generating ordinal data
Pier Alda Ferrari, Alessandro Barbiero
Department of Economics, Business and Statistics, Universita di Milanovia Conservatorio, 7 - 20122 Milan (Italy)
Abstract
Due to the increasing use of ordinal variables in different fields, new statistical methods for their analysis have been
introduced, whose performances need to be investigated under different experimental conditions. Proper procedures
to simulate from ordinal variables are then requested. The present paper deals with the simulation from multivariate
ordinal random variables. A new proposal for generating samples from ordinal random variables with pre-specified
correlation matrix and marginal distributions is presented. Its features are examined and a comparison to its main
competitors is discussed. A software implementation by the R package is provided. Examples of application are
also supplied.
1. Introduction
In the recent years, an increasing interest has been devoted to categorical ordinal data and methods for their
statistical analysis. Most of these methods concern the reduction of dimensionality of the original dataset, given
the huge number of units and/or variables usually involved (large datasets) and the degree of association present in
the variables. Let us think of questionnaires on service quality or risk propensity, or also psychometric assessment,
where many questions are submitted to respondents about the different aspects of their satisfaction/attitude towards
a service/product/situation experienced. Typically, the possible answers are ordered, generally according to the
levels of agreement to a statement, and are highly associated; a dataset with ordinal associated variables thus arises.
Statistical techniques are developed but their performance or properties are to be stated. In exploratory analysis,
the robustness and performance of a technique can be assessed nearly exclusively through simulation studies,
which require the generation of a huge number of datasets; but even in small sample size studies, the properties of
estimators must be studied by simulation when parameter estimation is based on asymptotic theory.
Different proposals have been presented for the construction of specific datasets of this kind. These techniques
are based on the transformation from original variables with known distributions to final variables with the desired
features. They follow two approaches: sample from original variables properly adjusted “a priori” in order to match
the requirements for the final variables, or adjust “a posteriori” the drawn sample to reach the desired features. Most
of them are however addressed to continuous random variables.
BePress November 14, 2011
In the framework of the first approach and starting from a multivariate standard normal distribution ZZZ with a
specified correlation matrix RRRZ , Lurie and Goldberg [15] presented an algorithm for generating a vector XXX of random
variables with given marginal distributions and correlation matrix as “close” as possible to a target matrix RRRX . This
proposal can only be applied to a set of final continuous variables with strictly increasing distribution function.
Cario et al. [19] proposed the NORTA (NORmal To Anything), a more extensive method which transforms through
an iterative procedure a multivariate standard normal random vector with correlation matrix RRRZ into the desired
random vector with correlation matrix RRRX and arbitrary marginal distributions. Stanhope [19] extended the core
idea of NORTA by proposing proper alternatives for the cases when NORTA fails moving from the NORTA to
MTA (Multivariate To Anything) procedure.
Charmpis and Panteli [5] proposed an interesting different two-step approach to solve the problem: first a
univariate random sample is independently drawn from each pre-specified marginal distribution, then a heuristic
optimization procedure sequentially rearranges the univariate samples, from the second to the last, in order to
match the desired correlation coefficient. Ruscio and Kaczetow [17] proposed a similar technique for simulating
multivariate non-normal samples using an iterative algorithm. This proposal, referred to as sample and iterate (SI)
technique, is able to generate a dataset that reproduces the distributions and correlations directly specified by the
user or observed in a sample of data provided. This technique can be applied for simulating both continuous and
discrete variables, but is also quite computationally intensive. Contrary to the first procedures, which transform
“a priori” the random variables and then sample from the multidimensional random variable so obtained, the latter
two algorithms transform the samples obtained, trying to match the target marginal distributions and correlations
in the sample.
With regard to the more specific case of final ordinal variables, not much has been proposed. For example,
Biswas [1] suggested a method for generating ordinal categorical data with a particular correlation structure but
this required i.d. ordinal random variables, so it cannot be applied when the marginal distributions are different, and
then even when the number of categories vary for different variables. That represents an obvious drawback, which
limits its general nature and applicability. An interesting method was introduced by Demirtas [6] for generating
multivariate ordinal data with given marginal distributions and correlation matrix. His technique is based on sim-
ulating from binary data whose marginals are derived collapsing the pre-specified marginals of ordinal variables.
The correlation matrix is obtained by an iterative process, in order to match the target correlation matrix for ordinal
data. Binary data are converted to ordinal data through a randomization step. This method is very flexible, it allows
to create ordinal data matrix with different number of categories and distribution for each variable and a specific
correlation pattern, but presents some drawbacks: the procedure is not easy and computationally expensive, for
example, in order to achieve the required correlations it needs to introduce an iterative procedure requiring the
generation of large samples of binary data that slows down the process.
2
This paper introduces a different and easy procedure to obtain multivariate ordinal variables, with assigned
marginal distributions and correlation structure. Its starting point is the simulation of a sample from a multivariate
normal random variable which is well documented in literature (see for example [13]) and implemented by almost
all statistical softwares; then the values of the sample are “discretized” in order to get an ordinal dataset. The matrix
RRRZ is adjusted to ensure the desired characteristics for the ordinal variables. The method allows the user to set “a
priori” the number of categories and the distribution for each final ordinal variable; and to set the correlation matrix
for the final multivariate ordinal variable, in such a way it is possible to assess the performance of a method varying
these experimental conditions. The procedure is implemented in R.
The paper is outlined as follows. In Section 2, the main features of the correlation coefficient for ordinal
variables and the relationship between the correlation coefficient of two continuous variables and that of the cor-
responding discretized variables are described. In Section 3 a proposal for generating ordinal datasets according
to specified experimental conditions is presented, while its features and comparison to the main competitors are
discussed in Section 4. In Section 5, the validity and applicability of the procedure is checked through two simula-
tion studies addressed to analyze sampling distributions and to compare two alternative methods of dimensionality
reduction analysis: PCA vs NLPCA. Finally, some conclusions are presented in Section 6.
2. Correlation for ordinal variables
Almost all the methods described above employ as a measure of association the Pearson’s ρ: this is due to
the far greater popularity enjoyed by this index if compared to others available in the literature, and to the lack of
computational procedures for generating random vectors with association structure specified in other terms than a
correlation matrix; a significant exception is given in [11]. The adoption of the Pearson’s ρ requires a numerical or
at least a Likert scale (1,2 . . . ,k) for all the variables. Henceforth, as we are treating with ordinal variables, we will
assume the Likert scale and treat it as a numerical scale. This assumption is more general than it seems: in fact,
since the correlation coefficient is invariant under location and scale transformations, the value of ρ depends on
the joint distribution of the variables, but does not change whatever values are assigned to the ordered categories,
provided that they are equidistant. In any case, what we are proposing can work even with different and not
equidistant numerical options for the categories of the ordinal scale.
Before presenting our proposal, two further aspects concerning the correlation coefficient ρ for ordinal variables
need to be discussed. The first regards the range of values that the correlation coefficient can take for discrete
variables, the second is connected to the relationship between the correlation of two continuous variables and the
correlation of the two corresponding discretized ones.
As far as we are concerned with the former aspect, it is worth noting that when the study involves two discrete
variables (X1,X2) with assigned marginal distributions, it is not always possible to assign any value in the interval
3
[−1,+1] to the correlation coefficient.
In fact, if F1(x1) and F2(x2) are the (right-continuous) cumulative distribution functions (cdf) of X1 and X2 and
F(x1,x2) their joint cdf, it is known that:
(a) Fm(x1,x2)≤F(x1,x2)≤FM(x1,x2),where Fm(x1,x2)=max{0,F1(x1)+F2(x2)−1} and FM(x1,x2)=min{F1(x1),F2(x2)}
(Frechet, 1951) and
(b) the correlation coefficient between X1 and X2 is minimized at F(x1,x2) = Fm(x1,x2) and maximized at
F(x1,x2) = FM(x1,x2) (Schweizer and Wolff, 1981) where Fm and FM are the Frechet minimal and maxi-
mal distributions defined in (a).
For the specific case of ordinal variables, the above results (a) and (b) keep holding true and allow us to
calculate the minimum (ρm) and maximum (ρM) values of the correlation coefficient. In fact, given the two marginal
distributions F1(x1) and F2(x2), it is sufficient to determine the cdf Fm(x1,x2) and FM(x1,x2), adopt a Likert scale
for both X1 and X2, and then calculate ρm and ρM.
Coming to the second issue, the consequences of a discretization procedure of two continuous variables on
the value of their correlation coefficient was analysed by Guilford (1965) in connection to the discretization of a
standard normal bivariate distribution. He shows that if Z1 and Z2 are two variables with joint normal distribution
and correlation coefficient ρC, and X1, X2 the two dichotomous variables obtained collapsing Z1 and Z2 according
to whether Z1 and Z2 are smaller or greater than their mean, the correlation coefficient ρO between X1 and X2 is
ρO = 2π
sin−1(ρC). It is smaller than ρC and can be approximated by 23 ρC for ρC smaller than 0.6.
Apart from some further special cases (see for example [14]) it is not possible to write the relationship between
ρO and ρC in explicit terms, so Bollen and Barb [2] analyze through a simulation study how the discretization
process influences the correlation for a normal bivariate distribution. They collapse the values of each variable
with k ≥ 2 categories, taking intervals of the same width, and obtain smaller correlations, the lower the number
of categories, the less close the correlations between the collapsed and the original variables. Cario et al. [3]
notice that when transforming a standardized multidimensional normal variable ZZZ into a generic variable XXX , the
correlation matrices RRRZ and RRRX only slightly differ if the marginals Xi are continuous and relatively symmetrical;
while they significantly differ if the Xi are discrete (and not symmetrical): they consider the case of a trivariate
random variable with all Binomial marginals (n = 3; p = 0.5).
Since in our proposal the relationship between ρC and ρO is a crucial point, we further discuss this issue
determining the values of the ordinal correlation coefficient ρO between two uniformly distributed final ordinal
variables X1 and X2, obtained by discretizing two standard normal variables Z1 and Z2, for different values of their
correlation coefficient ρC and number k of ordinal categories (for simplicity, k is fixed equal for the two variables).
The correlation coefficient of the ordinal variables is determined by the following relationship:
ρO =
∑kl=1 ∑
kt=1(l−E(X1))(t−E(X2)) fZ1Z2(l/k, t/k)
∑(l−E(X1))2(1/k)∑(t−E(X2))2(1/k)(1)
4
ρC
ρO
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0.0
0.2
0.4
0.6
0.8
1.0
●
●
k=2k=5k=8
(a) Values of ρO for discretized variables varying ρC for
different k
k
ρOρC
●
●
●
●
●●
●● ●
●
●
●
●
●●
● ● ●
2 3 4 5 6 7 8 9 10
0.6
0.7
0.8
0.9
1.0
●
●
ρC = 0.8ρC = 0.5ρC = 0.2
(b) Values of the ratio ρO/ρC varying ρC for different k
Figure 1: Comparison between ρO and ρC (discretization with uniform marginals)
where
fZ1,Z2(l/k, t/k) = ΦZ1,Z2(Φ−1Z (l/k),Φ−1
Z (t/k))−ΦZ1,Z2(Φ−1Z (l/k),Φ−1
Z ((t−1)/k))
−ΦZ1,Z2(Φ−1Z ((l−1)/k),Φ−1
Z (t/k))+ΦZ1,Z2(Φ−1Z ((l−1)/k),Φ−1
Z ((t−1)/k),
being ΦZ1,Z2 the bivariate standard normal distribution function; ΦZ the univariate standard normal distribution
function and Φ−1Z its inverse. ρO can be calculated by the package mvtnorm, which allows to compute the values
of the bivariate normal distribution function.
ρC, k 2 4 6 8 10
.80 .590 .729 .759 .770 .776
.60 .410 .531 .557 .567 .572
.40 .262 .347 .366 .374 .377
.20 .128 .172 .182 .186 .188
(a) Inside: values of ρO varying ρC for different k
ρC, k 2 4 6 8 10
.80 .738 .911 .948 .963 .969
.60 .683 .885 .929 .946 .954
.40 .655 .868 .916 .934 .943
.20 .641 .859 .909 .928 .938
(b) Inside: values of the ratio ρO/ρC varying ρC for
different k
Table 1: Comparison between ρC and ρO for different values of ρC and k (discretization with uniform marginals).
A synthesis of the results is reported in Table 1: in (a) for each value of ρC the corresponding ρO varying k
are given, while in (b) the related ratios ρO/ρC are reported. We note that ρC and ρO are close when ρC is high
and/or the number of categories k is large. For better detecting the role of marginal distributions, the values of ρO
5
ρC
ρO
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0.0
0.2
0.4
0.6
0.8
1.0
●
●
k=2k=5k=8
(a) Values of ρO for discretized variables for different ρC
and k
k
ρOρC
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
2 3 4 5 6 7 8 9 10
0.6
0.7
0.8
0.9
1.0
●
●
ρC = 0.8ρC = 0.5ρC = 0.2
(b) Values of the ratio ρO/ρC for different ρC and k
Figure 2: Comparison between ρO and ρC (discretization with equal-width intervals)
and ratio ρO/ρC are also obtained, the discretization is performed by dividing the (−3,3) of Z into intervals with
the same width (so that symmetrical but not uniform distribution is obtained). These results (see Tables 2a and 2b)
are only slightly different from the previous ones.
ρC, k 2 4 6 8 10
.80 .590 .673 .736 .763 .776
.60 .410 .502 .552 .572 .582
.40 .262 .334 .368 .381 .388
.20 .128 .167 .184 .191 .194
(a) Inside, values of ρO for different ρC and k
ρC, k 2 4 6 8 10
.80 .738 .841 .920 .954 .970
.60 .683 .837 .920 .953 .969
.40 .655 .836 .920 .953 .969
.20 .641 .836 .920 .953 .969
(b) Inside, values of ρO/ρC for different ρC and k
Table 2: Comparison between ρC and ρO for different ρC and k (discretization with intervals with the same width).
For making more evident the comparison between ρC and ρO varying k, the results are displayed in Figures 1
and 2, which refer to the two discretization procedures, respectively: equal probabilities and equal-width intervals.
From the analysis of the content of Tables 1 and 2 and Figures 1 and 2, we can state that actually the correla-
tion coefficient of two continuous normal variables is always larger than the one of the corresponding discretized
variables, but they get closer when the number of categories of the discrete ordinal variables and/or the value of ρC
itself increase. This is a point that needs to be taken into account when generating discrete data from continuous
variables or representing continuous variables with their discretized version. Incidentally, the method that we are
going to introduce allows us to know the magnitude of the change in ρ passing from continuous to ordinal variables
6
and vice versa, in connection with different experimental conditions, and consequently allows us to take that into
account it in carrying out the analysis.
3. The new proposal
The method we propose is based on the transformation and adjustment of the original variables for producing
ordinal variables with the required characteristics. It is developed in two steps. The first step sets up the original
continuous variables in order to achieve the ordinal variables which meet the experimental conditions; the second
generates samples for simulation from the original variables so adjusted.
3.1. Setting up of the continuous variables
First of all we need, fixed the experimental conditions, to find the procedure which assures them. The procedure
we develop is based on this scheme.
We take a r.v. ZZZ ∼ N(000,RRRC), with RRRC known (in the first stage RRRC = RRRO∗, where RRRO∗ is the target correlation
matrix among ordinal variables), and transform it by employing a quantile approach; we achieve an ordinal r.v. XXX
with assigned marginal distributions. We compute the RRRO on XXX , compare it with RRRO∗ and adjust the RRRC through an
“ad hoc” iterative procedure until RRRO converges to RRRO∗.
More in detail, we consider a m-dimensional variable ZZZ, following a standard normal distribution with corre-
lation matrix RRRC, i.d. ZZZ ∼ N(000,RRRC = RRRO∗) at the first step. We transform the original variable ZZZ into a variable
XXX with categorical (ordinal) components, proceeding in the following way (see also [8]). For the i-th component,
the ki− 1 probabilities 0 < Fi1 < Fi2 < · · · < Fil < · · · < Fi(ki−1) < 1 of the marginal distribution of Xi define the
corresponding normal quantiles ri1 < ri2 < · · ·< ril < · · ·< ri,ki−1. The values of variable Zi are then converted into
ordinal categories, labeled with integer numbers Xi, in the following way:
if Zi < ri1→ Xi = 1
if ri1 ≤ Zi < ri2→ Xi = 2
. . .
if ri(ki−1) ≤ Zi→ Xi = ki. (2)
An m-dimensional ordinal variable XXX = (X1,X2, . . . ,Xk) is thus settled. The single components Xi of XXX can have
different number of categories and different marginal probabilities, according to the number ki and values of Fil
chosen.
This procedure ensures the desired marginal distributions FFF i for each component Xi, but the correlation matrix
RRRO related to ordinal vector XXX may sensibly differ from the chosen matrix RRRC = RRRO∗, because of the discretization
process, that alters the correlation coefficients (see Section 2). In order to overcome this problem, it is necessary to
7
determine a continuous correlation matrix RRRC∗ able to ensure for the transformed m-dimensional ordinal variable
the target ordinal correlation matrix RRRO∗. This is sorted out by the following algorithm:
1. Set the correlation matrix for the continuous variables RRRC(0) = RO∗, and discretize ZZZ according to the pro-
cedure described above, thus obtaining an ordinal random variable XXX (1). This step can be performed in R
through the function pvtnorm, which computes the distribution function of the multivariate normal distri-
bution for arbitrary limits and correlation matrix;
2. Compute RRRO(1) for XXX (1);
3. Loop: while max∣∣∣ρO(k)
i j −ρO∗i j
∣∣∣> ε for some i 6= j and k ≤ kmax (kmax and ε > 0, discussed after):
(a) Update each element of the correlation matrix of the multivariate normal distribution, imposing:
ρC(k)i j = ρ
C(k−1)i j fi j(k) i = 1,2, . . . ,m, j > i (3)
where fi j(k) =ρO∗
i j
ρO(k)i j
, ∀i 6= j. The ratio fi j(k) can thus be assumed as a “correction coefficient” for
step k from continuous to ordinal variable with those fixed marginal distributions and according to the
discretization procedure described above; the update expressed by (3) is adapted when required by
peculiar situations;
(b) if RRRC(k) is no longer a correlation matrix, find the “nearest positive definite matrix” (see [12] for the
implementation in R);
(c) generate the ordinal random variable XXX (k+1) according to the discretization algorithm described above,
using RRRC(k);
(d) compute RRRO(k+1);
4. The final continuous correlation matrix RRRC∗ is used to generate a sample n×m ordinal matrix coming from
variables with assigned marginal distributions and ordinal correlation matrix RRRO∗.
ε is a maximum admissible error for the extra-diagonal elements of the actual correlation matrix of the ordinal r.v.
(a plausible value is 0.001). To ensure the convergence of the algorithm, if ε is set too small, a maximum number
of iterations kmax can be fixed (e.g. kmax = 100).
3.2. Generation of sample
For every experimental condition, i.e. for every set of marginal distributions and correlation matrix RRRO∗, the
related RRRC∗ is determined, then the dataset is obtained by drawing from ZZZ ∼ N(000,RC∗) a random sample of size
n and transforming the values according to the discretization process described in Subsection 3.1. For simulation
studies this drawing is iterated the desired number of times under the same experimental conditions. At the end of
the process, the experimental conditions are changed according to the experimental design.
This method is implemented in R and the code is available from the authors upon request.
8
4. Features and comparison with main competitors
The algorithm described in the previous section assures that the sample is drawn from an ordinal multidimen-
sional distribution satisfying the required experimental conditions (marginal distributions and correlation matrix).
In particular, if ρO∗i j is the (i, j)-th element of matrix RRRO∗ and ρC∗
i j the corresponding element of matrix RRRC∗ the
procedure is able to determine the m(m−1)/2 values di j (dilation coefficient), which allow us to move from ρO∗i j
to ρC∗i j ∀i 6= j. More in general it provides
ρC∗i j = di j(ρ
Oi j ), ρ
Oi j m
< ρOi j < ρ
Oi j M
,
with di j depending on the marginal distributions FFF i and FFF j of components Xi and X j. That means that the proposed
procedure is able to find the value of ρC leading to ρO numerically, once the marginals are assigned, for each value
of ρO. In other words, it allows to reconstruct ρC as a function of ρO.
In order to give an idea about the relationship between ρC and ρO, in Table 3a, the values of ρC in dependence
on some fixed values of ρO (on rows) and number of categories of ordinal variables with uniform distributions
are reported. The Table is displayed in a similar way as Table 1a, the ρC are now determined according to the
algorithm described. In Table 3b, like in Table 1b, with an inversion between ρC and ρO, “dilation coefficients”
ρC/ρO are reported. It is worthwhile noting that if the values of ρC in Table 3a are used as correlation coefficients
of a two-dimensional normal variable, then for the discretized uniform ordinal variables provided by (2) we obtain
the values of ρO that produced ρC, as it can be easily verified.
The procedure is illustrated with regard to a Likert scale but it can be used and maintains its characteristics also
in case of different and not equidistant values for the categories.
The algorithm presents some specific features and advantages if compared to the similar procedures for ordinal
variables, described in Section 1. Towards them, our proposal enjoys some valuable characteristics:
• it is an easy and computationally efficient tool to obtain a multivariate discrete variable (in particular, ordinal):
the only restriction, due to the discretization procedure, is that each marginal distribution has a finite support.
Most available procedures do not give explicit practical details for discrete cases (Cario et al., Charmpis and
Panteli) or confine themselves to a limited context (Biswas);
• it is able to state the exact correspondence between RRRO and RRRC (by an iterative procedure) and between RRRC
and RRRO (by a direct calculation). In such way the difference between the correlation matrices yielded by
different samples of ordinal data drawn under the same experimental conditions are due only to sampling
errors. On the contrary, Demirtas’ procedure requires the generation of an intermediate sample to state
the ordinal correlations, introducing an additional sampling error, while, on the opposite side, Ruscio and
Kaczetov, by considering adjustments of the sample in order to reach the target correlation matrix, reduce the
9
ρO,k 2 3 4 5 6 8 10
.80 .951 .901 .868 .850 .839 .829 .823
.60 .809 .710 .672 .654 .643 .633 .628
.40 .588 .490 .459 .444 .436 .427 .424
.20 .309 .250 .233 .224 .220 .216 .213
(a) Values of ρC
ρO,k 2 3 4 5 6 8 10
.80 1.189 1.127 1.084 1.062 1.049 1.036 1.029
.60 1.348 1.184 1.120 1.089 1.072 1.055 1.046
.40 1.469 1.226 1.146 1.110 1.089 1.069 1.059
.20 1.545 1.252 1.163 1.122 1.100 1.077 1.067
(b) Values of ρC/ρO
Table 3: Values of ρC and “dilation coefficient” ρC/ρO computed by the algorithm for different values of ρO and
k.
sampling error or need to yield a large simulated population with the desired features from which drawing
the sample. NORTA presents instead some practical limits: a desired pairwise correlation ρOi j between the
i-th and j-th variables might not be feasible, i.e. it is not possible to find a ρCi j that produces the desired ρO
i j ; or
the desired correlation values ρOi j are feasible, but the matrix of ρC
i j , producing the corresponding ρOi j , might
not be definite positive, and, hence, not a correlation matrix.
• it samples from a continuous random variable and then discretize it, using an intuitive and straightforward
method which employs the quantiles of the normal distribution, on the contrary, for example, the Demirtas’
algorithm, which samples from binary variables, introduces an arbitrary choice with regard to the generating
algorithm and involves an arbitrary splitting procedure for collapsing the ordinal categories into binary ones.
• it cannot get stuck, since 1) the correlation matrix for ordinal variables the user enters is checked (defi-
nite positiveness and lower/upper bounds, see below) 2) it includes the nearest positive definite algorithm
which adjusts the updated correlation matrix, and 3) it provides for a maximum number of iterations after
which a correlation matrix RRRC∗ is in any case obtained. That does not always occur for the other procedures
considered.
• the computation time is short: we observed that for a 4-dimensional variable XXX whose components have a
different number of categories (2, 3, 4 and 5) the setting-up of the correlation matrix RRRC∗, which includes the
check for the feasibility of RRRO∗, takes on average (i.e. for different RRRO∗ and marginal distributions) about 1
10
second; drawing a sample from XXX takes about 0.01 seconds, for a sample size n varying from 100 to 1000.
Last but not least, the procedure is easy to understand and use, even for non-experts. The R code developed allows
the user to choose the main “parameters” of the ordinal data matrix: n, m, ki, i = 1, . . . ,m, FFF i, and the association
level among the m ordinal variables is directly provided by the user through the target correlation matrix RO∗.
Moreover, the coherence of the user requirements is checked. A particular attention has been devoted to the chosen
correlation matrix. The values of ρOi j and the marginal distributions of the i-th and j-th variables are verified to
be coherent; the entered matrix is verified to be actually a definite positive matrix. In case, the algorithm helps
the users by providing lower and upper bounds for each ρOi j given the marginal distributions and warns them if the
entered matrix RRRO is not a correlation matrix. For these reasons the procedure seems helpful to both research and
teaching purposes.
5. Examples of application
In this Section we propose two examples of possible applications of the procedure. The aim of these applica-
tions is not to analyze in-depth the problem discussed, but just to show how our procedure can easily and quickly
lead to results. The first issue refers to the construction of the sampling distribution of the estimator of the corre-
lation coefficient for ordinal data and the second one concerns the comparison of two different methods, PCA and
NLPCA, addressed to the setting up of a complex indicator based on observed ordinal data.
5.1. Correlation coefficient estimation
In inferential problems, it is usually required to determine the sampling distribution of the sample correlation
coefficient R, for testing hypotheses or constructing confidence intervals for ρ . If the observations come from a
bivariate normal distribution, an approximated distribution of the sample correlation coefficient can be adopted. If
the following Fisher transformation:
W = arctanh(R) =12
ln(
1+R1−R
)is considered, see [7], and ω = 1
2 ln(
1+ρ
1−ρ
)is set, it can be proved that the random variable (W −ω)/
√1/(n−3)
converges to a standard normal distribution. It is then straightforward, recalling the formulas above, to build an
approximate CI for ω at level (1−α), given by:
(wL,wU) =
(w− zα/2
√1
n−3,w+ zα/2
√1
n−3
)(4)
and, by applying the monotonic inverse transformation, a CI for ρ , given by:
(rL,rU) = (tanh(wL), tanh(wU)) =
(ewL− e−wL
ewL + e−wL,ewU − e−wU
eU + e−wU
)=
(e2wL−1e2wL +1
,e2wU −1e2wU +1
). (5)
11
k p
2 0.5 0.5
3 1/3 1/3 1/3
4 0.25 0.25 0.25 0.25
5 0.2 0.2 0.2 0.2 0.2
6 1/6 1/6 1/6 1/6 1/6 1/6
(a) Uniform
k p
2 0.4 0.6
3 0.1 0.3 0.6
4 0.1 0.1 0.2 0.6
5 0.1 0.1 0.1 0.1 0.6
6 0.1 0.1 0.1 0.1 0.1 0.5
(b) Asymmetrical
Table 4: Number k of categories and probabilities p assumed for marginal distributions
Of course the problem is more complex when ρ concerns non-normal variables, especially discrete/ordinal
variables. For inferential problems, if we refer to the CI (5), there is no evidence whether that keeps being valid
also in these cases. Our procedure offers a possibility of finding the sampling distribution of R for discrete/ordinal
variables under different experimental conditions throughout Monte Carlo simulations. In fact, as we have seen
in the previous sections, the discretization process from a bivariate normal variable affects the resulting ordinal
correlation coefficient ρO through the number of categories, the marginal distributions and the original value of ρC.
It is interesting to empirically investigate how these factors can affect the sampling distribution of R, focusing in
particular on the real coverage of the CI given by (5).
For this purpose, we consider a pair of ordinal variables, whose marginal distributions are chosen from those re-
ported in Table 4, where k is the number of categories, and three pairwise correlation coefficients: ρ = {0.2,0.4,0.6},
coherent with the marginal distributions. By combining the possible marginal distributions and the values of ρ , we
obtain many scenarios. Under each of them, following our procedure, we generate an ordinal matrix with dimen-
sion n = 500, m = 2, compute the sample correlation coefficient r, and then construct the 95% CI for ρ by (5). In
addition, for each ρ , we also draw from the bivariate normal distribution a sample of the same size n = 500 and
determine by (5) an analogous CI for ρ . We iterate these steps 10,000 times; at the end of the simulation plan, we
compute the Monte Carlo distributions of the sample correlation coefficients R, the coverages of these CI and their
average width. Note that the average length of the intervals depends on ρ and n, the latter here is fixed, but not on
the marginal distributions, as Equations (4) and (5) show, and is the same for ordinal or continuous variables.
Here we present and discuss only some of these results, specifically those connected with ordinal variables
with the same number of categories (k = 2,3,4,5,6) and two different types of marginal distribution (uniform and
asymmetrical). In Figure 3, the boxplots of R related to these scenarios for both continuous and ordinal variables
are drawn. In Table 5, the MC coverages and average width of CI for ρ for continuous and ordinal variables are
given. We can first observe that for every ρ the coverage of the CI based on the Fisher transformation is always
very near to the nominal one, confirming the validity of Fisher’s approximation for bivariate normal variable. In
12
●●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●●●
●
●●
●●
●
●●
●
●
●●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●●
●●
●●●
●
●
●
●
●●
●
●●●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●●●●
●
●
●
●
●●
●●
●
●●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●●
●●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●●
●
●●
●
●
●
●
●
●●
●●●
●●●●●
●
●
●
●
●
●
●
●●●
●
●●
●
●●
●
●●
●
●●
●
●
●
●
●●
●●
●
●
●
●
●●●
●
●
●
●
●●●●●
●
●
●●●
●●
●
●●●
●
●
●
●
●●
●
●
●●
●
● ●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●●
●
●●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●●
●
●●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●
●
●
●
●●●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●●●
●
●●
●
●
●
C U2 U3 U4 U5 U6
0.0
0.1
0.2
0.3
0.4
●●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●●●
●
●●
●●
●
●●
●
●
●●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●●
●●
●
●
●
●
●●
●●
●
●
●●●
●
●
●
●●
●●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●●
●
●●
●
●●●
●●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●●●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●●●●
●●●●●
●●
●
●
●
●●●
●
●
●
●
●●●●
●
●
●
●
●
●
●
●
●
●●
●●●
●
●
●●
●●●
●
●●
●●
●
●
●●●
●●
●
●
●
●●
●
●
●
●
●●●
●
●●
●●
●
●
●
●
●
●●●●
●
●
●
●
●
●●
●
●
●
●
●●●
●●●●●
●
●
●
●
●
●●
●
●●
●●●●
●
●
●●
●
●●
●●●
●●●
●
●●●●
●
●
●●
●
●
●
●●
●●
●●●
●●●
●
C A2 A3 A4 A5 A6
0.0
0.1
0.2
0.3
0.4
(a) Boxplot of sampling distribution of R for Continuous (C), Uniform (U) and Asymmetrical (A) ordinal marginals,
varying k, ρ = 0.2
●●
●●●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●●●
●
●●
●●●●
●●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●
●●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●●
●
●
●
●●●
●●●
●
●●
●
●
●
●
●
●
●●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●●
●●
●
●
●
●
●●
●
●
●●●●●
●●●
●
●●
●
●
●
●
●●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●●
●
●
●
●●
●
●
●●●
●
●●
●●●
●
●
●
●●●
●
●
●●
●
●
●●●
●●
●
●
●●●●●
●
●
●
●●
●
●
●
●
●
●●
●●●●
●
●
●
●●●●
●
●
●
●●
●
●
●
●●●●●●●●●●
●
●●
●●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●●
●
●●
●●
●●●
●
●
●
●
●
●●
●
●●●●●
●●●
●
●●●
●●
●
●
●
●●●●
●
●
●
●●
●
●
●
●
●
●
●●
●●●●
●
●
●●●
●
●●
●●●
●
●
●
●●
●●
●
●
●●
●●●
●
●●
●●
●
●
●
●●●●
●
●●●
●
●
●●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●●
●
●●●
●
●●
●
●
●
●
●
●
●●
●●
C U2 U3 U4 U5 U6
0.2
0.3
0.4
0.5●●
●●●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●●●
●
●●
●●●●
●●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●
●●●
●
●
●
●
●
●
●
●●
●●
●
●
●●
●●
●
●●●●●
●
●
●
●
●●●●
●●
●
●
●
●●
●
●
●●●●
●
●●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●●
●
●
●
●
●●
●●
●
●
●●●
●●
●
●●
●
●●
●
●●●
●
●
●
●●
●
●
●
●
●●
●
●●
●●●
●●●●
●
●●●
●
●●
●●
●
●
●
●●
●
●
●
●●
●
●●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●●●●●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●
●●
●
●●●
●
●
●●●●
●
●
●
●
●●●●
●
●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●●●●
●
●
●
●●
●
●●●●
●
●
●
●●
●
●
●
●●
●●●●
●●●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●●●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●●
●●
●●
●
●
●●●
●
●
●
●
●
●●
●
●●
●●●
●
●
●
●
●●●
●●
●
●●●
●●
●
●
●●
●●●●●●●
●
●
●●●
●●●●
C A2 A3 A4 A5 A6
0.2
0.3
0.4
0.5
(b) Boxplot of sampling distribution of R for Continuous (C), Uniform (U) and Asymmetrical (A) ordinal marginals,
varying k, ρ = 0.4
●●
●●●
●
●●
●●
●
●
●●●
●
●
●
●●●
●
●
●●●●
●●
●●
●
●●
●●
●
●●●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●●
●
●●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●●●
●
●●
●
●●
●●●
●●
●●
●
●●
●●
●●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●●
●
●●
●
●●●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●●●
●
●
●
●
●●
●
●
●●●
●●
●
●●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●
●●
●●
●●
●
●
●
●
●●
●
●
●●●
●●●●
●
●●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●●●●●●
●●●●●●
●
●
●
●●●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●●●●
●
●●
●
●
●
●
●●
●●
●
●
●
●●
●
●●●
●
●
●
●
●
●●●●●●
●
●
●
●
●
●●●
●●●●●●●●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●●
●
●●●
●
●●●
●
●
●
●●
●
●●●
●
●
●●
●
●●●●●
●
●
●●●
●●●
●
●
●
●●●●
●
●
●●●
●●●
●
●●
●
●●
●
●
●
●●
●
●
●
●●
●
●●
●
C U2 U3 U4 U5 U6
0.45
0.50
0.55
0.60
0.65
0.70●●
●●●
●
●●
●●
●
●
●●●
●
●
●
●●●
●
●
●●●●
●●
●●
●
●●
●●
●
●●●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●●
●
●●●
●
●
●
●
●
●
●
●●
●●
●
● ●
●
●●
●●
●
●●●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●●●●●●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●
●
●
●●●
●
●
●
●
●●●●●
●
●
●
●
●
●
●●●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●●●
●
●
●
●
●●
●
●●
●●●●●●
●
●
●
●●●
●
●
●
●●●
●
●●●
●
●●
●●●●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●●
●
●●●●●●●
●
●
●
●
●
●
●
●
●
●
●
●●●●●●●●●●
●
●●● ●
●●
●
●
●
●
●
●●
●
●
●
●●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●●
●●●●
●
●●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●●
●●
●●
●
●
●
●
●●
●
●
●
●
●●●●
●
●●
●
●●
●●
●
●
●
●●
●
●●
●
●
●
●●
●
●
●
●●●●●
●
●
●
●●
●
●
●
●●
●●
●
●●●●
●
●●●●
●
●
●
●●●●
●
●
●●
●●
C A2 A3 A4 A5 A6
0.45
0.50
0.55
0.60
0.65
0.70
(c) Boxplot of sampling distribution of of R for Continuous (C), Uniform (U) and Asymmetrical (A) ordinal
marginals, varying k, ρ = 0.6
Figure 3: Boxplot of sample correlation coefficient R MC distributions.
13
case of ordinal variables, the overall results are still quite good, since the coverage is near to 88%, in the worst
case considered. With regard to our experimental conditions, the best results are obtained for uniform marginal
distributions and ρ = 0.2; the worst results under asymmetrical distributions and ρ = 0.6; the number of categories
seems not to have a clear effect on the coverage, while the interval length just depends on ρ: when ρ increases,
the length decreases. These results can be roughly explained recalling that symmetrical distributed variables tend
to be “closer to the normal”, case in which Fisher’s approximation works; and this is truer when increasing the
number of categories/values the variable can assume, even if for ordinal variables this cannot be appreciated given
the small number of categories. For asymmetrical variables, the opposite point can be used. By synthesizing the
results, we can conclude that high values of ρ and an asymmetrical distribution worsen the performance of the
interval estimator for ordinal variables.
Since the results of this analysis stress that asymmetrical distributions along with high values of ρ represent
critical scenarios for the use of the normal approximation of Fisher’s transformation, when ordinal variables are in
analysis, we focus on a specific experimental condition (asymmetrical distribution and ρ = 0.6) considering small
sample sizes and checking if and how a decreasing n can further affect the results. Some findings are reported in
Table 6, they show no evidence on the role of n.
Uniform distribution, ρ = 0.2 Asymmetrical distribution, ρ = 0.2
C k = 2 3 4 5 6 k = 2 3 4 5 6
coverage 0.9506 0.9490 0.9495 0.9487 0.9515 0.9443 0.9424 0.9397 0.9275 0.9355 0.9348
ave. width 0.1682 0.1681 0.1681 0.1681 0.1681 0.1681 0.1681 0.1681 0.1681 0.1681 0.1681
Uniform distribution, ρ = 0.4 Asymmetrical distribution, ρ = 0.4
C k = 2 3 4 5 6 k = 2 3 4 5 6
coverage 0.9505 0.9260 0.9413 0.9448 0.9429 0.9431 0.9230 0.9149 0.9017 0.9117 0.9181
ave. width 0.1474 0.1472 0.1473 0.1472 0.1473 0.1473 0.1472 0.1472 0.1473 0.1473 0.1473
Uniform distribution, ρ = 0.6 Asymmetrical distribution, ρ = 0.6
C k = 2 3 4 5 6 k = 2 3 4 5 6
coverage 0.9507 0.8839 0.9324 0.9359 0.9401 0.9364 0.8759 0.9042 0.8809 0.8824 0.9029
ave. width 0.1126 0.1124 0.1124 0.1124 0.1125 0.1125 0.1123 0.1125 0.1126 0.1125 0.1125
Table 5: MC coverage and average width of CI for different ρ in case of continuous (C) or k-category ordinal
variables.
Of course, we present these scenarios as examples of application of the procedure and do not mean to cover
all the possible cases the researcher face in real applications; nevertheless, the general advice we can derive from
these simulations is to be careful when r is great and the sample marginal distributions are asymmetrical.
For these situations, if a coverage closer to the nominal level is requested, a different method could be used.
We suggest, for example, the following two alternatives for building a CI for ρ from a bivariate sample s of size n:
14
n,k C 2 3 4 5 6
200 0.9511 0.8767 0.9042 0.8782 0.8782 0.9057
100 0.9487 0.8647 0.9004 0.8745 0.8805 0.9050
75 0.9500 0.8636 0.9053 0.8771 0.8771 0.9039
50 0.9478 0.8697 0.9008 0.8790 0.8728 0.8990
40 0.9518 0.8743 0.8992 0.8725 0.8688 0.9021
25 0.9499 0.8614 0.8921 0.8680 0.8651 0.8983
Table 6: MC coverages for ρ = 0.6 in case of normal and asymmetrical distributions with k categories, varying n.
• select n units with replacement from s, obtain the bootstrap sample s∗ and compute on it the correlation
coefficient r∗; repeat it B times and build a bootstrap distribution{
r∗b}
. Then build a CI from the α/2 and
1−α/2 quantiles.
• determine the two marginal distributions and r on the sample s. Consider the obtained sample distributions
and r as “target” and, using the algorithm proposed in Section 3, draw B samples under those sample experi-
mental conditions, compute on each of them r∗, and thus its distribution{
r∗b}
. Then build a CI as above.
5.2. PCA vs NLPCA
When we need to construct a composite numerical indicator for a latent variable using data from a questionnaire,
the most useful and proper technique is Principal Component Analysis, in its linear (LPCA or PCA) or non-linear
(NLPCA) version. The hypothesis on which PCA relies is the linear relationship among the observed variables;
if this hypothesis fails for the data, NLPCA can be applied, allowing for a nonlinear transformation of the data
throughout an “optimal” quantifications of the categories. In both the cases, the indicator is constructed as a
linear combination of the observed variables, whose categories are quantified by a Likert scale (PCA) or optimal
quantifications (NLPCA). The outcomes of both PCA and NLPCA are:
• loadings β j, one for each variable, representing the weight of the variable in building the composite numerical
indicator;
• scores, one for each unit, representing the value of the latent phenomenon for that unit;
• maximum eigenvalue λmax of the correlation matrix of the (transformed) variables: it can be expressed as a
reproduced percentage of the total variance of the (transformed) variables;
• in addition, for NLPCA, optimal quantifications for the categories of each variable.
15
In this context, the researchers might want to answer for example these questions: “How much do PCA and
NLPCA results differ?” or “Is it worthwhile, and when, using the more complex NLPCA instead of the more
popular and easier PCA?”. The answers to these questions possibly depend on the number of variables, the number
of categories, the variables’ distribution, the association structure among the variables. By using our proposal we
can try to give some answers to these questions. In fact, many ordinal datasets can be constructed for each experi-
mental scenario; in such a way it is possible to empirically investigate if the results differ and which experimental
conditions influence more these differences. Here we mean just to show how we can proceed to compare the results
of PCA and NLPCA under different experimental conditions and give some early findings.
With this goal in mind, we fixed the experimental conditions, generate the desired ordinal matrix DDD according
to the algorithm, described in Section 3, and obtain the ordinal dataset (units × variables), then we perform both
PCA and NLPCA on it. As known, with PCA, the category labels (1, 2...) of the Likert-scale variables are treated
as related to numerical variables and these are transformed to standardized variables. With NLPCA the category
labels are transformed by the procedure into optimal numerical values (quantifications), producing variables with
zero mean and unitary variance as well.
We compare the results provided by the two analysis by three ways:
• difference in scores, by computing an euclidean distance d between PCA scores and NLPCA scores:
d =
√1n
n
∑i=1
(xiL− xiNL)2 =
√1n
n
∑i=1
x2iL + x2
iNL−2xiLxiNL =
√1n(2n−2nρL,NL) =
√2(1−ρL,NL) (6)
since ∑ni=1 xiL = ∑
ni=1 xiNL = 0, var(xiL) = var(xiNL) = 1 because of the normalization constraints. The higher
is the difference between the two methods (the distance d), the more opportune is the adoption of NLPCA.
This distance is expressed as a decreasing function of the correlation between their scores;
• the ratio λ NLmax/λ L
max between the maximum eigenvalue extracted by NLPCA and PCA on the same matrix
DDD: the ratio can take only values greater than 1, and the greater the value, the larger the advantage in using
NLPCA.
• a kind of “nonlinearity index”, built as follows:
NLI = m−m
∑i=1
ρ(qqqiL,qqqiNL) (7)
which takes value 0 if and only if the quantifications are equidistant and hence the existing relationship is
linear, while takes greater and greater values as the nonlinearity of the data is increasing. The greater is
NLI, the more feasible is the adoption of NLPCA. A slightly different normalized index of nonlinearity is
proposed in [4].
16
We considered a 4-dimensional ordinal variable, whose components have the first four distributions of Table 4,
dealing separately the two cases (uniform and asymmetrical). In a first step, we assumed the same value of corre-
lation coefficient ρ for all the four variables fixing it equal to the values 0.2, 0.4, 0.6; so obtaining three correlation
matrices RRR(0.2), RRR(0.4) and RRR(0.6) and six scenarios. Then, we moved to further six scenarios, using the same
marginal distributions but varying the correlation matrix RRR(ρ) into RRR′(ρ), having the same maximum eigenvalue of
RRR(ρ), but with unequal pairwise correlations within. The choice is obviously not unique; we considered the matrix
RRR′(ρ):
R′(ρ) =
1 a b ρ
a 1 a b
b a 1 a
ρ b a 1
with a = 2b.
For each of the 12 scenarios, by using our procedure, S = 2,000 Monte Carlo samples of size n = 500 are
drawn. For each scenario and each sample s we calculate the distance d(s), as defined in (6), then we yield the
empirical distribution function, Fd . The functions Fd , one for each of the twelve scenarios, are plotted in Figure
4. The Figure points out that the difference in the scores and hence in results between PCA and NLPCA is greater
for smaller values of ρ and, in lower magnitude, for asymmetrical distributions. In fact, a small value of ρ means
that the data do not present a strong linear structure, and then NLPCA might be more proper than PCA, because
it could catch a potential nonlinear structure. With regard to the second finding, it may be noted that if marginal
distributions are asymmetrical the scores are more sensitive to the difference in values of the categories, so once
again NLPCA seems more suitable. The variability of ρi j’s within the matrix RRR′ (for the same λmax) does not seem
to affect the results, confirming an analogous finding in [16], obtained following a different approach. From this
simple simulation study, the factor that influences most the difference in scores between PCA and NLPCA is clearly
the magnitude of ρ .
The values of the ratio λ NLmax/λ L
max and NLI were also computed on matrices DDD under each scenario and for each
Monte Carlo run. In Figure 5 the boxplot of the ratio varying matrices RRR(0.2), RRR(0.4) and RRR(0.6) are displayed. As
one can see, reducing the value of ρ and passing from uniform to asymmetrical distribution the ratio increases and
makes NLPCA more suitable.
Finally, in Figure 6, the boxplot of NLI under the same 12 scenarios is provided. The index increases when
reducing ρ and passing from uniform to asymmetrical distributions, thus confirming the previous findings (NLPCA
is preferable to PCA when the linear correlation coefficients are small and the distribution is asymmetrical).
17
0.00 0.02 0.04 0.06 0.08
0.0
0.2
0.4
0.6
0.8
1.0
R(0.2)R(0.4)R(0.6)
(a) Uniform marginal distribution and correlation matrix RRR with
constant ρ inside
0.00 0.02 0.04 0.06 0.08
0.0
0.2
0.4
0.6
0.8
1.0
R(0.2)R(0.4)R(0.6)
(b) Uniform marginal distribution and correlation matrix RRR′ with
unequal ρ inside
0.00 0.02 0.04 0.06 0.08
0.0
0.2
0.4
0.6
0.8
1.0
R(0.2)R(0.4)R(0.6)
(c) Asymmetrical marginal distribution and correlation matrix RRR
with constant ρ inside
0.00 0.02 0.04 0.06 0.08
0.0
0.2
0.4
0.6
0.8
1.0
R(0.2)R(0.4)R(0.6)
(d) Asymmetrical marginal distribution and correlation matrix RRR′
with unequal ρ inside
Figure 4: Empirical distributions of difference between PCA and NLPCA scores under different scenarios
18
●●
●
●●
●●●●
●
●
●
●
●
●
●●
●●●●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
1.00
1.01
1.02
1.03
1.04
1.05
1.06
(a) Uniform distributions,
ρ = 0.2
●●●●●●
●
●●●●●
●
●●
●●
●●●●●●●●●●●●●●●●●●●●●
●●●●●●
●●●●●●●
1.00
1.01
1.02
1.03
1.04
1.05
1.06
(b) Uniform distributions,
ρ = 0.4
●●●
●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●
1.00
1.01
1.02
1.03
1.04
1.05
1.06
(c) Uniform distributions,
ρ = 0.6
●
●
●
●
●
●
●●
●●
●●
●●
●●
●
●●●
●
●
●
●
●●●●●●
●●
●
●●●
●
●●
●
●
●●
●
●●
●
●
●●●●
●
●●
●●
●
●●●
●
1.00
1.01
1.02
1.03
1.04
1.05
1.06
(d) Asymmetrical distribu-
tions, ρ = 0.2
●●
●
●
●
●●
●
●
●
●●●●●●●●●●●●
●●
●
●●●●
●
●
●
●●
●
●●●●
●
●●●
●
1.00
1.01
1.02
1.03
1.04
1.05
1.06
(e) Asymmetrical distribu-
tions, ρ = 0.4
●●●●●●
●
●●
●
●
●●●●●●●●●●●●●●●
●
●●
●
●●●●●●●●●●
1.00
1.01
1.02
1.03
1.04
1.05
1.06
(f) Asymmetrical distribu-
tions, ρ = 0.6
Figure 5: Boxplot of the ratio λ NLmax/λ L
max under different scenarios
19
●
●
●●
●
●●
●●●●
●●●●
●
●
●
●●
●
●
●
●
●
●●
●
0.0
0.1
0.2
0.3
(a) Uniform distributions,
ρ = 0.2
●●●●●●
●
●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0.0
0.1
0.2
0.3
(b) Uniform distributions,
ρ = 0.4
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●
0.0
0.1
0.2
0.3
(c) Uniform distributions,
ρ = 0.6
●
●●●●
●
●
●
●
●
●●
●●
●●
●●●●●●●
●
●●●●●
●
●
●
●
●
●●●●
●
●
●
●●
●
●
●●●
●
●
●
●●●
●
0.0
0.1
0.2
0.3
(d) Asymmetrical distribu-
tions, ρ = 0.2
●
●●●●●●
●
●
●
●●●●●●●
●
●
●
●●
●
●●●
●
●●●
●
●●●●●
●
0.0
0.1
0.2
0.3
(e) Asymmetrical distribu-
tios, ρ = 0.4
●●●
●
●
●
●●●
●
●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●
0.0
0.1
0.2
0.3
(f) Asymmetrical distribu-
tions, ρ = 0.6
Figure 6: Boxplot of NLI under different scenarios
20
6. Conclusions
A method for generating ordinal data matrices under fixed experimental conditions, such as marginal distri-
butions and correlation matrix, is proposed. This method can be seen as a technique that transform multivariate
normal variables into other random variables with partially assigned distribution, and is specifically addressed to
ordinal variables. It starts from a multivariate standard normal variable ZZZ with a known correlation matrix; from ZZZ a
random sample is drawn and the sample values are transformed into ordinal ones according to the desired marginal
distributions. The correlation matrix of ZZZ is properly computed by an original algorithm in order to assure the
required correlation matrix for the ordinal variables.
The proposal is very flexible since it allows the researcher to choose the desired number of categories and
the marginal distribution for each ordinal variable, and the association structure among variables themselves, by
entering the desired correlation matrix and checking for its feasibility. The procedure can be used to simulate
samples from ordinal variables with a pre-specified structure and is suitable, for example, to check the behavior
or performance of estimators, tests, and statistical methods of analysis or to compare different methods varying
experimental conditions.
To show its applicability two simulation studies are carried out, one addressed to an inferential problem, the
other related to an exploratory analysis. The first aims at setting-up a sampling distribution for the correlation co-
efficient in the ordinal case and trying out the effectiveness of the procedure usually adopted under some scenarios;
in the second, a comparison between the results of PCA and NLPCA for building a composite indicator is proposed
and some suggestions about the choice between them are given. Both applications show how the method works
and how its application is easy. Moreover, since this method is implemented in R and is a very user-friendly tool,
it can also be helpful for both research and teaching purposes.
References
[1] Biswas, A. (2004) Generating correlated ordinal categorical random samples, Statistics & Probability Letters
70(1): 25-35
[2] Bollen, K.A., Barb, B.H. (1981) Pearson’s R and Coarsely Categorized Measures, American Sociological
Review 46(2): 232-239
[3] Cario, M. C., Nelson, B.L. (1997) Modeling and generating random vectors with arbitrary marginal dis-
tributions and correlation matrix. Technical report, Department of Industrial Engineering and Management
Sciences, Northwestern University, Evanston, Illinois
[4] Carpita, M., Manisera, M., On the nonlinearity of homogeneous ordinal variables. In S. Ingrassia, R. Rocci,
21
M. Vichi (Eds.), New Perspectives in Statistical Modeling and Data Analysis. Springer Series “Studies in
Classification, Data Analysis, and Knowledge Organization”, ISBN: 978-3-642-11362-8 (2011)
[5] Charmpis, D.C., Panteli, P.L. (2004) A heuristic approach for the generation of multivariate random samples
with specified marginal distributions and correlation matrix, Computational Statistics 19: 283-300
[6] Demirtas, H. (2009) A method for multivariate ordinal data generation given marginal distributions and
correlations, Journal of Statistical Computation and Simulation 76(11): 1017-1025
[7] Fisher, R.A. (1921) On the “probable error” of a coefficient of correlation deduced from a small sample,
Metron 1: 3-32
[8] Ferrari, P.A., Annoni, P., Barbiero, A., Manzi, G. (2011) An imputation method for categorical variables
with application to nonlinear principal component analysis, Computational Statistics & Data Analysis 55(6):
2410-2420
[9] Frechet, M. (1951) Sur les tableaux de correlation dont les marges sont donnes, Ann. Univ. Lyon, Section A,
Series 3(14): 53-77
[10] Guilford, J. P., Fundamental Statistics in Psychology and Education. New York: McGraw-Hill (1965).
[11] Iman, R.L., Conover, W.J. (1982) A Distribution-Free Approach to Inducing Rank Correlation Among Input
Variables, Communications in Statistics: Simulation Computation 11(3): 311-334.
[12] Higham, N. (2002) Computing the nearest correlation matrix - a problem from finance, IMA Journal of
Numerical Analysis 22: 329-343
[13] Johnson, M. E. 1987. Multivariate Statistical Simulation. New York: John Wiley.
[14] Li, S.T., Hammond, J.L. (1975) Generation of pseudorandom numbers with specified univariate distributions
and correlation coefficients, IEEE Transactions on Systems, Man and Cybernetics 5: 557-561
[15] Lurie, P.M., Goldberg, M.S. (1998) An Approximate Method for Sampling Correlated Random Variables
from Partially-Specified Distributions, Management Science 44(2): 203-218
[16] van Rijckevorsel, J.L.A., Bettonvil, B., de Leeuw, J., (1985). Recovery and Stability in Non-linear Principal
Components Analysis. University of Leiden: Department of Data Theory.
[17] Ruscio, J., Kaczetow, W. (2008) Simulating Multivariate Nonnormal Data Using an Iterative Algorithm,
Multivariate Behavioral Research, 43(3): 355-381
22
[18] Schweizer, B., Wolff, E. F. (1981) On nonparametric measures of dependence for random variables, Annals
of Statistics 9: 879-885
[19] Stanhope, S. (2005) Case studies in multivariate-to-anything transforms for partially specified random vector
generation, Insurance: Mathematics and Economics 1(1): 68-79
23