working paper n. 2011-38 2011wp.demm.unimi.it/tl_files/wp/2011/demm-2011_038wp.pdf · working paper...

DIPARTIMENTO DI SCIENZE ECONOMICHE AZIENDALI E STATISTICHE

Via Conservatorio 7 20122 Milano

tel. ++39 02 503 21501 (21522) - fax ++39 02 503 21450 (21505) http://www.economia.unimi.it

E Mail: [email protected]

GENERATING ORDINAL DATA

PIER ALDA FERRARI ALESSANDRO BARBIERO

Working Paper n. 2011-38

DICEMBRE 2011

Generating ordinal data

Pier Alda Ferrari, Alessandro Barbiero

Department of Economics, Business and Statistics, Universita di Milanovia Conservatorio, 7 - 20122 Milan (Italy)

Abstract

Due to the increasing use of ordinal variables in different fields, new statistical methods for their analysis have been

introduced, whose performances need to be investigated under different experimental conditions. Proper procedures

to simulate from ordinal variables are then requested. The present paper deals with the simulation from multivariate

ordinal random variables. A new proposal for generating samples from ordinal random variables with pre-specified

correlation matrix and marginal distributions is presented. Its features are examined and a comparison to its main

competitors is discussed. A software implementation by the R package is provided. Examples of application are

also supplied.

1. Introduction

In the recent years, an increasing interest has been devoted to categorical ordinal data and methods for their

statistical analysis. Most of these methods concern the reduction of dimensionality of the original dataset, given

the huge number of units and/or variables usually involved (large datasets) and the degree of association present in

the variables. Let us think of questionnaires on service quality or risk propensity, or also psychometric assessment,

where many questions are submitted to respondents about the different aspects of their satisfaction/attitude towards

a service/product/situation experienced. Typically, the possible answers are ordered, generally according to the

levels of agreement to a statement, and are highly associated; a dataset with ordinal associated variables thus arises.

Statistical techniques are developed but their performance or properties are to be stated. In exploratory analysis,

the robustness and performance of a technique can be assessed nearly exclusively through simulation studies,

which require the generation of a huge number of datasets; but even in small sample size studies, the properties of

estimators must be studied by simulation when parameter estimation is based on asymptotic theory.

Different proposals have been presented for the construction of specific datasets of this kind. These techniques

are based on the transformation from original variables with known distributions to final variables with the desired

features. They follow two approaches: sample from original variables properly adjusted “a priori” in order to match

the requirements for the final variables, or adjust “a posteriori” the drawn sample to reach the desired features. Most

of them are however addressed to continuous random variables.

BePress November 14, 2011

In the framework of the first approach and starting from a multivariate standard normal distribution ZZZ with a

specified correlation matrix RRRZ , Lurie and Goldberg [15] presented an algorithm for generating a vector XXX of random

variables with given marginal distributions and correlation matrix as “close” as possible to a target matrix RRRX . This

proposal can only be applied to a set of final continuous variables with strictly increasing distribution function.

Cario et al. [19] proposed the NORTA (NORmal To Anything), a more extensive method which transforms through

an iterative procedure a multivariate standard normal random vector with correlation matrix RRRZ into the desired

random vector with correlation matrix RRRX and arbitrary marginal distributions. Stanhope [19] extended the core

idea of NORTA by proposing proper alternatives for the cases when NORTA fails moving from the NORTA to

MTA (Multivariate To Anything) procedure.

Charmpis and Panteli [5] proposed an interesting different two-step approach to solve the problem: first a

univariate random sample is independently drawn from each pre-specified marginal distribution, then a heuristic

optimization procedure sequentially rearranges the univariate samples, from the second to the last, in order to

match the desired correlation coefficient. Ruscio and Kaczetow [17] proposed a similar technique for simulating

multivariate non-normal samples using an iterative algorithm. This proposal, referred to as sample and iterate (SI)

technique, is able to generate a dataset that reproduces the distributions and correlations directly specified by the

user or observed in a sample of data provided. This technique can be applied for simulating both continuous and

discrete variables, but is also quite computationally intensive. Contrary to the first procedures, which transform

“a priori” the random variables and then sample from the multidimensional random variable so obtained, the latter

two algorithms transform the samples obtained, trying to match the target marginal distributions and correlations

in the sample.

With regard to the more specific case of final ordinal variables, not much has been proposed. For example,

Biswas [1] suggested a method for generating ordinal categorical data with a particular correlation structure but

this required i.d. ordinal random variables, so it cannot be applied when the marginal distributions are different, and

then even when the number of categories vary for different variables. That represents an obvious drawback, which

limits its general nature and applicability. An interesting method was introduced by Demirtas [6] for generating

multivariate ordinal data with given marginal distributions and correlation matrix. His technique is based on sim-

ulating from binary data whose marginals are derived collapsing the pre-specified marginals of ordinal variables.

The correlation matrix is obtained by an iterative process, in order to match the target correlation matrix for ordinal

data. Binary data are converted to ordinal data through a randomization step. This method is very flexible, it allows

to create ordinal data matrix with different number of categories and distribution for each variable and a specific

correlation pattern, but presents some drawbacks: the procedure is not easy and computationally expensive, for

example, in order to achieve the required correlations it needs to introduce an iterative procedure requiring the

generation of large samples of binary data that slows down the process.

2

This paper introduces a different and easy procedure to obtain multivariate ordinal variables, with assigned

marginal distributions and correlation structure. Its starting point is the simulation of a sample from a multivariate

normal random variable which is well documented in literature (see for example [13]) and implemented by almost

all statistical softwares; then the values of the sample are “discretized” in order to get an ordinal dataset. The matrix

RRRZ is adjusted to ensure the desired characteristics for the ordinal variables. The method allows the user to set “a

priori” the number of categories and the distribution for each final ordinal variable; and to set the correlation matrix

for the final multivariate ordinal variable, in such a way it is possible to assess the performance of a method varying

these experimental conditions. The procedure is implemented in R.

The paper is outlined as follows. In Section 2, the main features of the correlation coefficient for ordinal

variables and the relationship between the correlation coefficient of two continuous variables and that of the cor-

responding discretized variables are described. In Section 3 a proposal for generating ordinal datasets according

to specified experimental conditions is presented, while its features and comparison to the main competitors are

discussed in Section 4. In Section 5, the validity and applicability of the procedure is checked through two simula-

tion studies addressed to analyze sampling distributions and to compare two alternative methods of dimensionality

reduction analysis: PCA vs NLPCA. Finally, some conclusions are presented in Section 6.

2. Correlation for ordinal variables

Almost all the methods described above employ as a measure of association the Pearson’s ρ: this is due to

the far greater popularity enjoyed by this index if compared to others available in the literature, and to the lack of

computational procedures for generating random vectors with association structure specified in other terms than a

correlation matrix; a significant exception is given in [11]. The adoption of the Pearson’s ρ requires a numerical or

at least a Likert scale (1,2 . . . ,k) for all the variables. Henceforth, as we are treating with ordinal variables, we will

assume the Likert scale and treat it as a numerical scale. This assumption is more general than it seems: in fact,

since the correlation coefficient is invariant under location and scale transformations, the value of ρ depends on

the joint distribution of the variables, but does not change whatever values are assigned to the ordered categories,

provided that they are equidistant. In any case, what we are proposing can work even with different and not

equidistant numerical options for the categories of the ordinal scale.

Before presenting our proposal, two further aspects concerning the correlation coefficient ρ for ordinal variables

need to be discussed. The first regards the range of values that the correlation coefficient can take for discrete

variables, the second is connected to the relationship between the correlation of two continuous variables and the

correlation of the two corresponding discretized ones.

As far as we are concerned with the former aspect, it is worth noting that when the study involves two discrete

variables (X1,X2) with assigned marginal distributions, it is not always possible to assign any value in the interval

3

[−1,+1] to the correlation coefficient.

In fact, if F1(x1) and F2(x2) are the (right-continuous) cumulative distribution functions (cdf) of X1 and X2 and

F(x1,x2) their joint cdf, it is known that:

(a) Fm(x1,x2)≤F(x1,x2)≤FM(x1,x2),where Fm(x1,x2)=max{0,F1(x1)+F2(x2)−1} and FM(x1,x2)=min{F1(x1),F2(x2)}

(Frechet, 1951) and

(b) the correlation coefficient between X1 and X2 is minimized at F(x1,x2) = Fm(x1,x2) and maximized at

F(x1,x2) = FM(x1,x2) (Schweizer and Wolff, 1981) where Fm and FM are the Frechet minimal and maxi-

mal distributions defined in (a).

For the specific case of ordinal variables, the above results (a) and (b) keep holding true and allow us to

calculate the minimum (ρm) and maximum (ρM) values of the correlation coefficient. In fact, given the two marginal

distributions F1(x1) and F2(x2), it is sufficient to determine the cdf Fm(x1,x2) and FM(x1,x2), adopt a Likert scale

for both X1 and X2, and then calculate ρm and ρM.

Coming to the second issue, the consequences of a discretization procedure of two continuous variables on

the value of their correlation coefficient was analysed by Guilford (1965) in connection to the discretization of a

standard normal bivariate distribution. He shows that if Z1 and Z2 are two variables with joint normal distribution

and correlation coefficient ρC, and X1, X2 the two dichotomous variables obtained collapsing Z1 and Z2 according

to whether Z1 and Z2 are smaller or greater than their mean, the correlation coefficient ρO between X1 and X2 is

ρO = 2π

sin−1(ρC). It is smaller than ρC and can be approximated by 23 ρC for ρC smaller than 0.6.

Apart from some further special cases (see for example [14]) it is not possible to write the relationship between

ρO and ρC in explicit terms, so Bollen and Barb [2] analyze through a simulation study how the discretization

process influences the correlation for a normal bivariate distribution. They collapse the values of each variable

with k ≥ 2 categories, taking intervals of the same width, and obtain smaller correlations, the lower the number

of categories, the less close the correlations between the collapsed and the original variables. Cario et al. [3]

notice that when transforming a standardized multidimensional normal variable ZZZ into a generic variable XXX , the

correlation matrices RRRZ and RRRX only slightly differ if the marginals Xi are continuous and relatively symmetrical;

while they significantly differ if the Xi are discrete (and not symmetrical): they consider the case of a trivariate

random variable with all Binomial marginals (n = 3; p = 0.5).

Since in our proposal the relationship between ρC and ρO is a crucial point, we further discuss this issue

determining the values of the ordinal correlation coefficient ρO between two uniformly distributed final ordinal

variables X1 and X2, obtained by discretizing two standard normal variables Z1 and Z2, for different values of their

correlation coefficient ρC and number k of ordinal categories (for simplicity, k is fixed equal for the two variables).

The correlation coefficient of the ordinal variables is determined by the following relationship:

ρO =

∑kl=1 ∑

kt=1(l−E(X1))(t−E(X2)) fZ1Z2(l/k, t/k)

∑(l−E(X1))2(1/k)∑(t−E(X2))2(1/k)(1)

4

ρC

ρO

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0.0

0.2

0.4

0.6

0.8

1.0

●

●

k=2k=5k=8

(a) Values of ρO for discretized variables varying ρC for

different k

k

ρOρC

●

●

●

●

●●

●● ●

●

●

●

●

●●

● ● ●

2 3 4 5 6 7 8 9 10

0.6

0.7

0.8

0.9

1.0

●

●

ρC = 0.8ρC = 0.5ρC = 0.2

(b) Values of the ratio ρO/ρC varying ρC for different k

Figure 1: Comparison between ρO and ρC (discretization with uniform marginals)

where

fZ1,Z2(l/k, t/k) = ΦZ1,Z2(Φ−1Z (l/k),Φ−1

Z (t/k))−ΦZ1,Z2(Φ−1Z (l/k),Φ−1

Z ((t−1)/k))

−ΦZ1,Z2(Φ−1Z ((l−1)/k),Φ−1

Z (t/k))+ΦZ1,Z2(Φ−1Z ((l−1)/k),Φ−1

Z ((t−1)/k),

being ΦZ1,Z2 the bivariate standard normal distribution function; ΦZ the univariate standard normal distribution

function and Φ−1Z its inverse. ρO can be calculated by the package mvtnorm, which allows to compute the values

of the bivariate normal distribution function.

ρC, k 2 4 6 8 10

.80 .590 .729 .759 .770 .776

.60 .410 .531 .557 .567 .572

.40 .262 .347 .366 .374 .377

.20 .128 .172 .182 .186 .188

(a) Inside: values of ρO varying ρC for different k

ρC, k 2 4 6 8 10

.80 .738 .911 .948 .963 .969

.60 .683 .885 .929 .946 .954

.40 .655 .868 .916 .934 .943

.20 .641 .859 .909 .928 .938

(b) Inside: values of the ratio ρO/ρC varying ρC for

different k

Table 1: Comparison between ρC and ρO for different values of ρC and k (discretization with uniform marginals).

A synthesis of the results is reported in Table 1: in (a) for each value of ρC the corresponding ρO varying k

are given, while in (b) the related ratios ρO/ρC are reported. We note that ρC and ρO are close when ρC is high

and/or the number of categories k is large. For better detecting the role of marginal distributions, the values of ρO

5

ρC

ρO

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0.0

0.2

0.4

0.6

0.8

1.0

●

●

k=2k=5k=8

(a) Values of ρO for discretized variables for different ρC

and k

k

ρOρC

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

2 3 4 5 6 7 8 9 10

0.6

0.7

0.8

0.9

1.0

●

●

ρC = 0.8ρC = 0.5ρC = 0.2

(b) Values of the ratio ρO/ρC for different ρC and k

Figure 2: Comparison between ρO and ρC (discretization with equal-width intervals)

and ratio ρO/ρC are also obtained, the discretization is performed by dividing the (−3,3) of Z into intervals with

the same width (so that symmetrical but not uniform distribution is obtained). These results (see Tables 2a and 2b)

are only slightly different from the previous ones.

ρC, k 2 4 6 8 10

.80 .590 .673 .736 .763 .776

.60 .410 .502 .552 .572 .582

.40 .262 .334 .368 .381 .388

.20 .128 .167 .184 .191 .194

(a) Inside, values of ρO for different ρC and k

ρC, k 2 4 6 8 10

.80 .738 .841 .920 .954 .970

.60 .683 .837 .920 .953 .969

.40 .655 .836 .920 .953 .969

.20 .641 .836 .920 .953 .969

(b) Inside, values of ρO/ρC for different ρC and k

Table 2: Comparison between ρC and ρO for different ρC and k (discretization with intervals with the same width).

For making more evident the comparison between ρC and ρO varying k, the results are displayed in Figures 1

and 2, which refer to the two discretization procedures, respectively: equal probabilities and equal-width intervals.

From the analysis of the content of Tables 1 and 2 and Figures 1 and 2, we can state that actually the correla-

tion coefficient of two continuous normal variables is always larger than the one of the corresponding discretized

variables, but they get closer when the number of categories of the discrete ordinal variables and/or the value of ρC

itself increase. This is a point that needs to be taken into account when generating discrete data from continuous

variables or representing continuous variables with their discretized version. Incidentally, the method that we are

going to introduce allows us to know the magnitude of the change in ρ passing from continuous to ordinal variables

6

and vice versa, in connection with different experimental conditions, and consequently allows us to take that into

account it in carrying out the analysis.

3. The new proposal

The method we propose is based on the transformation and adjustment of the original variables for producing

ordinal variables with the required characteristics. It is developed in two steps. The first step sets up the original

continuous variables in order to achieve the ordinal variables which meet the experimental conditions; the second

generates samples for simulation from the original variables so adjusted.

3.1. Setting up of the continuous variables

First of all we need, fixed the experimental conditions, to find the procedure which assures them. The procedure

we develop is based on this scheme.

We take a r.v. ZZZ ∼ N(000,RRRC), with RRRC known (in the first stage RRRC = RRRO∗, where RRRO∗ is the target correlation

matrix among ordinal variables), and transform it by employing a quantile approach; we achieve an ordinal r.v. XXX

with assigned marginal distributions. We compute the RRRO on XXX , compare it with RRRO∗ and adjust the RRRC through an

“ad hoc” iterative procedure until RRRO converges to RRRO∗.

More in detail, we consider a m-dimensional variable ZZZ, following a standard normal distribution with corre-

lation matrix RRRC, i.d. ZZZ ∼ N(000,RRRC = RRRO∗) at the first step. We transform the original variable ZZZ into a variable

XXX with categorical (ordinal) components, proceeding in the following way (see also [8]). For the i-th component,

the ki− 1 probabilities 0 < Fi1 < Fi2 < · · · < Fil < · · · < Fi(ki−1) < 1 of the marginal distribution of Xi define the

corresponding normal quantiles ri1 < ri2 < · · ·< ril < · · ·< ri,ki−1. The values of variable Zi are then converted into

ordinal categories, labeled with integer numbers Xi, in the following way:

if Zi < ri1→ Xi = 1

if ri1 ≤ Zi < ri2→ Xi = 2

. . .

if ri(ki−1) ≤ Zi→ Xi = ki. (2)

An m-dimensional ordinal variable XXX = (X1,X2, . . . ,Xk) is thus settled. The single components Xi of XXX can have

different number of categories and different marginal probabilities, according to the number ki and values of Fil

chosen.

This procedure ensures the desired marginal distributions FFF i for each component Xi, but the correlation matrix

RRRO related to ordinal vector XXX may sensibly differ from the chosen matrix RRRC = RRRO∗, because of the discretization

process, that alters the correlation coefficients (see Section 2). In order to overcome this problem, it is necessary to

7

determine a continuous correlation matrix RRRC∗ able to ensure for the transformed m-dimensional ordinal variable

the target ordinal correlation matrix RRRO∗. This is sorted out by the following algorithm:

1. Set the correlation matrix for the continuous variables RRRC(0) = RO∗, and discretize ZZZ according to the pro-

cedure described above, thus obtaining an ordinal random variable XXX (1). This step can be performed in R

through the function pvtnorm, which computes the distribution function of the multivariate normal distri-

bution for arbitrary limits and correlation matrix;

2. Compute RRRO(1) for XXX (1);

3. Loop: while max∣∣∣ρO(k)

i j −ρO∗i j

∣∣∣> ε for some i 6= j and k ≤ kmax (kmax and ε > 0, discussed after):

(a) Update each element of the correlation matrix of the multivariate normal distribution, imposing:

ρC(k)i j = ρ

C(k−1)i j fi j(k) i = 1,2, . . . ,m, j > i (3)

where fi j(k) =ρO∗

i j

ρO(k)i j

, ∀i 6= j. The ratio fi j(k) can thus be assumed as a “correction coefficient” for

step k from continuous to ordinal variable with those fixed marginal distributions and according to the

discretization procedure described above; the update expressed by (3) is adapted when required by

peculiar situations;

(b) if RRRC(k) is no longer a correlation matrix, find the “nearest positive definite matrix” (see [12] for the

implementation in R);

(c) generate the ordinal random variable XXX (k+1) according to the discretization algorithm described above,

using RRRC(k);

(d) compute RRRO(k+1);

4. The final continuous correlation matrix RRRC∗ is used to generate a sample n×m ordinal matrix coming from

variables with assigned marginal distributions and ordinal correlation matrix RRRO∗.

ε is a maximum admissible error for the extra-diagonal elements of the actual correlation matrix of the ordinal r.v.

(a plausible value is 0.001). To ensure the convergence of the algorithm, if ε is set too small, a maximum number

of iterations kmax can be fixed (e.g. kmax = 100).

3.2. Generation of sample

For every experimental condition, i.e. for every set of marginal distributions and correlation matrix RRRO∗, the

related RRRC∗ is determined, then the dataset is obtained by drawing from ZZZ ∼ N(000,RC∗) a random sample of size

n and transforming the values according to the discretization process described in Subsection 3.1. For simulation

studies this drawing is iterated the desired number of times under the same experimental conditions. At the end of

the process, the experimental conditions are changed according to the experimental design.

This method is implemented in R and the code is available from the authors upon request.

8

4. Features and comparison with main competitors

The algorithm described in the previous section assures that the sample is drawn from an ordinal multidimen-

sional distribution satisfying the required experimental conditions (marginal distributions and correlation matrix).

In particular, if ρO∗i j is the (i, j)-th element of matrix RRRO∗ and ρC∗

i j the corresponding element of matrix RRRC∗ the

procedure is able to determine the m(m−1)/2 values di j (dilation coefficient), which allow us to move from ρO∗i j

to ρC∗i j ∀i 6= j. More in general it provides

ρC∗i j = di j(ρ

Oi j ), ρ

Oi j m

< ρOi j < ρ

Oi j M

,

with di j depending on the marginal distributions FFF i and FFF j of components Xi and X j. That means that the proposed

procedure is able to find the value of ρC leading to ρO numerically, once the marginals are assigned, for each value

of ρO. In other words, it allows to reconstruct ρC as a function of ρO.

In order to give an idea about the relationship between ρC and ρO, in Table 3a, the values of ρC in dependence

on some fixed values of ρO (on rows) and number of categories of ordinal variables with uniform distributions

are reported. The Table is displayed in a similar way as Table 1a, the ρC are now determined according to the

algorithm described. In Table 3b, like in Table 1b, with an inversion between ρC and ρO, “dilation coefficients”

ρC/ρO are reported. It is worthwhile noting that if the values of ρC in Table 3a are used as correlation coefficients

of a two-dimensional normal variable, then for the discretized uniform ordinal variables provided by (2) we obtain

the values of ρO that produced ρC, as it can be easily verified.

The procedure is illustrated with regard to a Likert scale but it can be used and maintains its characteristics also

in case of different and not equidistant values for the categories.

The algorithm presents some specific features and advantages if compared to the similar procedures for ordinal

variables, described in Section 1. Towards them, our proposal enjoys some valuable characteristics:

• it is an easy and computationally efficient tool to obtain a multivariate discrete variable (in particular, ordinal):

the only restriction, due to the discretization procedure, is that each marginal distribution has a finite support.

Most available procedures do not give explicit practical details for discrete cases (Cario et al., Charmpis and

Panteli) or confine themselves to a limited context (Biswas);

• it is able to state the exact correspondence between RRRO and RRRC (by an iterative procedure) and between RRRC

and RRRO (by a direct calculation). In such way the difference between the correlation matrices yielded by

different samples of ordinal data drawn under the same experimental conditions are due only to sampling

errors. On the contrary, Demirtas’ procedure requires the generation of an intermediate sample to state

the ordinal correlations, introducing an additional sampling error, while, on the opposite side, Ruscio and

Kaczetov, by considering adjustments of the sample in order to reach the target correlation matrix, reduce the

9

ρO,k 2 3 4 5 6 8 10

.80 .951 .901 .868 .850 .839 .829 .823

.60 .809 .710 .672 .654 .643 .633 .628

.40 .588 .490 .459 .444 .436 .427 .424

.20 .309 .250 .233 .224 .220 .216 .213

(a) Values of ρC

ρO,k 2 3 4 5 6 8 10

.80 1.189 1.127 1.084 1.062 1.049 1.036 1.029

.60 1.348 1.184 1.120 1.089 1.072 1.055 1.046

.40 1.469 1.226 1.146 1.110 1.089 1.069 1.059

.20 1.545 1.252 1.163 1.122 1.100 1.077 1.067

(b) Values of ρC/ρO

Table 3: Values of ρC and “dilation coefficient” ρC/ρO computed by the algorithm for different values of ρO and

k.

sampling error or need to yield a large simulated population with the desired features from which drawing

the sample. NORTA presents instead some practical limits: a desired pairwise correlation ρOi j between the

i-th and j-th variables might not be feasible, i.e. it is not possible to find a ρCi j that produces the desired ρO

i j ; or

the desired correlation values ρOi j are feasible, but the matrix of ρC

i j , producing the corresponding ρOi j , might

not be definite positive, and, hence, not a correlation matrix.

• it samples from a continuous random variable and then discretize it, using an intuitive and straightforward

method which employs the quantiles of the normal distribution, on the contrary, for example, the Demirtas’

algorithm, which samples from binary variables, introduces an arbitrary choice with regard to the generating

algorithm and involves an arbitrary splitting procedure for collapsing the ordinal categories into binary ones.

• it cannot get stuck, since 1) the correlation matrix for ordinal variables the user enters is checked (defi-

nite positiveness and lower/upper bounds, see below) 2) it includes the nearest positive definite algorithm

which adjusts the updated correlation matrix, and 3) it provides for a maximum number of iterations after

which a correlation matrix RRRC∗ is in any case obtained. That does not always occur for the other procedures

considered.

• the computation time is short: we observed that for a 4-dimensional variable XXX whose components have a

different number of categories (2, 3, 4 and 5) the setting-up of the correlation matrix RRRC∗, which includes the

check for the feasibility of RRRO∗, takes on average (i.e. for different RRRO∗ and marginal distributions) about 1

10

second; drawing a sample from XXX takes about 0.01 seconds, for a sample size n varying from 100 to 1000.

Last but not least, the procedure is easy to understand and use, even for non-experts. The R code developed allows

the user to choose the main “parameters” of the ordinal data matrix: n, m, ki, i = 1, . . . ,m, FFF i, and the association

level among the m ordinal variables is directly provided by the user through the target correlation matrix RO∗.

Moreover, the coherence of the user requirements is checked. A particular attention has been devoted to the chosen

correlation matrix. The values of ρOi j and the marginal distributions of the i-th and j-th variables are verified to

be coherent; the entered matrix is verified to be actually a definite positive matrix. In case, the algorithm helps

the users by providing lower and upper bounds for each ρOi j given the marginal distributions and warns them if the

entered matrix RRRO is not a correlation matrix. For these reasons the procedure seems helpful to both research and

teaching purposes.

5. Examples of application

In this Section we propose two examples of possible applications of the procedure. The aim of these applica-

tions is not to analyze in-depth the problem discussed, but just to show how our procedure can easily and quickly

lead to results. The first issue refers to the construction of the sampling distribution of the estimator of the corre-

lation coefficient for ordinal data and the second one concerns the comparison of two different methods, PCA and

NLPCA, addressed to the setting up of a complex indicator based on observed ordinal data.

5.1. Correlation coefficient estimation

In inferential problems, it is usually required to determine the sampling distribution of the sample correlation

coefficient R, for testing hypotheses or constructing confidence intervals for ρ . If the observations come from a

bivariate normal distribution, an approximated distribution of the sample correlation coefficient can be adopted. If

the following Fisher transformation:

W = arctanh(R) =12

ln(

1+R1−R

)is considered, see [7], and ω = 1

2 ln(

1+ρ

1−ρ

)is set, it can be proved that the random variable (W −ω)/

√1/(n−3)

converges to a standard normal distribution. It is then straightforward, recalling the formulas above, to build an

approximate CI for ω at level (1−α), given by:

(wL,wU) =

(w− zα/2

√1

n−3,w+ zα/2

√1

n−3

)(4)

and, by applying the monotonic inverse transformation, a CI for ρ , given by:

(rL,rU) = (tanh(wL), tanh(wU)) =

(ewL− e−wL

ewL + e−wL,ewU − e−wU

eU + e−wU

)=

(e2wL−1e2wL +1

,e2wU −1e2wU +1

). (5)

11

k p

2 0.5 0.5

3 1/3 1/3 1/3

4 0.25 0.25 0.25 0.25

5 0.2 0.2 0.2 0.2 0.2

6 1/6 1/6 1/6 1/6 1/6 1/6

(a) Uniform

k p

2 0.4 0.6

3 0.1 0.3 0.6

4 0.1 0.1 0.2 0.6

5 0.1 0.1 0.1 0.1 0.6

6 0.1 0.1 0.1 0.1 0.1 0.5

(b) Asymmetrical

Table 4: Number k of categories and probabilities p assumed for marginal distributions

Of course the problem is more complex when ρ concerns non-normal variables, especially discrete/ordinal

variables. For inferential problems, if we refer to the CI (5), there is no evidence whether that keeps being valid

also in these cases. Our procedure offers a possibility of finding the sampling distribution of R for discrete/ordinal

variables under different experimental conditions throughout Monte Carlo simulations. In fact, as we have seen

in the previous sections, the discretization process from a bivariate normal variable affects the resulting ordinal

correlation coefficient ρO through the number of categories, the marginal distributions and the original value of ρC.

It is interesting to empirically investigate how these factors can affect the sampling distribution of R, focusing in

particular on the real coverage of the CI given by (5).

For this purpose, we consider a pair of ordinal variables, whose marginal distributions are chosen from those re-

ported in Table 4, where k is the number of categories, and three pairwise correlation coefficients: ρ = {0.2,0.4,0.6},

coherent with the marginal distributions. By combining the possible marginal distributions and the values of ρ , we

obtain many scenarios. Under each of them, following our procedure, we generate an ordinal matrix with dimen-

sion n = 500, m = 2, compute the sample correlation coefficient r, and then construct the 95% CI for ρ by (5). In

addition, for each ρ , we also draw from the bivariate normal distribution a sample of the same size n = 500 and

determine by (5) an analogous CI for ρ . We iterate these steps 10,000 times; at the end of the simulation plan, we

compute the Monte Carlo distributions of the sample correlation coefficients R, the coverages of these CI and their

average width. Note that the average length of the intervals depends on ρ and n, the latter here is fixed, but not on

the marginal distributions, as Equations (4) and (5) show, and is the same for ordinal or continuous variables.

Here we present and discuss only some of these results, specifically those connected with ordinal variables

with the same number of categories (k = 2,3,4,5,6) and two different types of marginal distribution (uniform and

asymmetrical). In Figure 3, the boxplots of R related to these scenarios for both continuous and ordinal variables

are drawn. In Table 5, the MC coverages and average width of CI for ρ for continuous and ordinal variables are

given. We can first observe that for every ρ the coverage of the CI based on the Fisher transformation is always

very near to the nominal one, confirming the validity of Fisher’s approximation for bivariate normal variable. In

12

●●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●●●

●

●●

●●

●

●●

●

●

●●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●●

●●

●●●

●

●

●

●

●●

●

●●●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●●●●

●

●

●

●

●●

●●

●

●●

●

●

●●

●

●●

●

●

●

●

●●

●

●

●●

●●●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●●

●

●●

●

●

●

●

●

●●

●●●

●●●●●

●

●

●

●

●

●

●

●●●

●

●●

●

●●

●

●●

●

●●

●

●

●

●

●●

●●

●

●

●

●

●●●

●

●

●

●

●●●●●

●

●

●●●

●●

●

●●●

●

●

●

●

●●

●

●

●●

●

● ●

●

●

●●

●

●

●●

●

●●

●

●

●

●

●●

●

●●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●●●●

●

●

●●

●

●●●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●●

●

●

●

●●●

●

●

●

●

●

●●

●

●●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●●

●●●

●

●●

●

●

●

C U2 U3 U4 U5 U6

0.0

0.1

0.2

0.3

0.4

●●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●●●

●

●●

●●

●

●●

●

●

●●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●●

●●

●

●

●

●

●●

●●

●

●

●●●

●

●

●

●●

●●

●

●

●

●

●●

●

●●

●

●

●

●●

●

●●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●●●

●

●●

●

●●●

●●

●

●

●●

●●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●●

●●

●●

●

●

●

●●●

●

●

●

●●

●

●

●

●●

●

●

●●

●

●

●●

●

●

●

●

●●●●

●●●●●

●●

●

●

●

●●●

●

●

●

●

●●●●

●

●

●

●

●

●

●

●

●

●●

●●●

●

●

●●

●●●

●

●●

●●

●

●

●●●

●●

●

●

●

●●

●

●

●

●

●●●

●

●●

●●

●

●

●

●

●

●●●●

●

●

●

●

●

●●

●

●

●

●

●●●

●●●●●

●

●

●

●

●

●●

●

●●

●●●●

●

●

●●

●

●●

●●●

●●●

●

●●●●

●

●

●●

●

●

●

●●

●●

●●●

●●●

●

C A2 A3 A4 A5 A6

0.0

0.1

0.2

0.3

0.4

(a) Boxplot of sampling distribution of R for Continuous (C), Uniform (U) and Asymmetrical (A) ordinal marginals,

varying k, ρ = 0.2

●●

●●●

●

●

●●

●●

●

●

●●

●

●

●

●

●

●

●●●

●

●●

●●●●

●●

●

●●

●

●

●

●●

●

●●

●

●

●

●

●

●

●●

●

●●●

●

●

●

●

●

●

●

●●

●●

●

●

●

●●

●

●●

●

●

●

●●●

●●●

●

●●

●

●

●

●

●

●

●●●

●

●

●

●●

●

●

●

●

●

●

●●

●

●●●

●●

●

●

●

●

●●

●

●

●●●●●

●●●

●

●●

●

●

●

●

●●

●

●

●

●

●

● ●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●●

●

●

●

●●

●

●

●●●

●

●●

●●●

●

●

●

●●●

●

●

●●

●

●

●●●

●●

●

●

●●●●●

●

●

●

●●

●

●

●

●

●

●●

●●●●

●

●

●

●●●●

●

●

●

●●

●

●

●

●●●●●●●●●●

●

●●

●●

●●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●●

●●●

●

●●

●●

●●●

●

●

●

●

●

●●

●

●●●●●

●●●

●

●●●

●●

●

●

●

●●●●

●

●

●

●●

●

●

●

●

●

●

●●

●●●●

●

●

●●●

●

●●

●●●

●

●

●

●●

●●

●

●

●●

●●●

●

●●

●●

●

●

●

●●●●

●

●●●

●

●

●●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●●

●

●●●

●

●●

●

●

●

●

●

●

●●

●●

C U2 U3 U4 U5 U6

0.2

0.3

0.4

0.5●●

●●●

●

●

●●

●●

●

●

●●

●

●

●

●

●

●

●●●

●

●●

●●●●

●●

●

●●

●

●

●

●●

●

●●

●

●

●

●

●

●

●●

●

●●●

●

●

●

●

●

●

●

●●

●●

●

●

●●

●●

●

●●●●●

●

●

●

●

●●●●

●●

●

●

●

●●

●

●

●●●●

●

●●

●●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●●

●

●

●●●

●

●

●

●

●●

●●

●

●

●●●

●●

●

●●

●

●●

●

●●●

●

●

●

●●

●

●

●

●

●●

●

●●

●●●

●●●●

●

●●●

●

●●

●●

●

●

●

●●

●

●

●

●●

●

●●●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●●

●

●

●●●●●

●

●

●

●

●

●●

●

●●

●

●

●

●●

●

●

●●

●

●●●

●

●

●●●●

●

●

●

●

●●●●

●

●

●

●

●

●●●

●

●●

●

●

●

●

●

●

●

●●●●

●

●

●

●●

●

●●●●

●

●

●

●●

●

●

●

●●

●●●●

●●●

●

●●

●●

●

●

●

●

●

●●

●

●

●

●●●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●●

●●

●●

●

●

●●●

●

●

●

●

●

●●

●

●●

●●●

●

●

●

●

●●●

●●

●

●●●

●●

●

●

●●

●●●●●●●

●

●

●●●

●●●●

C A2 A3 A4 A5 A6

0.2

0.3

0.4

0.5

(b) Boxplot of sampling distribution of R for Continuous (C), Uniform (U) and Asymmetrical (A) ordinal marginals,

varying k, ρ = 0.4

●●

●●●

●

●●

●●

●

●

●●●

●

●

●

●●●

●

●

●●●●

●●

●●

●

●●

●●

●

●●●

●

●

●

●

●

●

●●

●●

●

●

●

●●

●

●●

●

●●●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●●●

●

●●

●

●●

●●●

●●

●●

●

●●

●●

●●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●●

●

●

●●

●

●●

●

●●●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●●●

●

●

●

●

●●

●

●

●●●

●●

●

●●●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●●

●●

●●

●

●

●

●

●●

●

●

●●●

●●●●

●

●●

●●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●●●●

●

●

●

●

●

●●●●●●

●●●●●●

●

●

●

●●●

●

●●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●●●●●

●

●●

●

●

●

●

●●

●●

●

●

●

●●

●

●●●

●

●

●

●

●

●●●●●●

●

●

●

●

●

●●●

●●●●●●●●

●●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●●●

●

●●●

●

●●●

●

●

●

●●

●

●●●

●

●

●●

●

●●●●●

●

●

●●●

●●●

●

●

●

●●●●

●

●

●●●

●●●

●

●●

●

●●

●

●

●

●●

●

●

●

●●

●

●●

●

C U2 U3 U4 U5 U6

0.45

0.50

0.55

0.60

0.65

0.70●●

●●●

●

●●

●●

●

●

●●●

●

●

●

●●●

●

●

●●●●

●●

●●

●

●●

●●

●

●●●

●

●

●

●

●

●

●●

●●

●

●

●

●●

●

●●

●

●●●

●

●

●

●

●

●

●

●●

●●

●

● ●

●

●●

●●

●

●●●●

●

●

●

●●

●

●●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●●●●●●

●

●●

●

●

●

●

●

●●

●

●

●

●●

●●

●

●

●

●

●

●●●

●

●

●

●

●●●●●

●

●

●

●

●

●

●●●

●

●

●●

●●

●

●

●

●

●

●

●●

●

●

●●

●

●●●

●

●

●

●

●●

●

●●

●●●●●●

●

●

●

●●●

●

●

●

●●●

●

●●●

●

●●

●●●●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●

●●

●

●●●●●●●

●

●

●

●

●

●

●

●

●

●

●

●●●●●●●●●●

●

●●● ●

●●

●

●

●

●

●

●●

●

●

●

●●●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●●

●●●●

●

●●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●●

●●

●●

●

●

●

●

●●

●

●

●

●

●●●●

●

●●

●

●●

●●

●

●

●

●●

●

●●

●

●

●

●●

●

●

●

●●●●●

●

●

●

●●

●

●

●

●●

●●

●

●●●●

●

●●●●

●

●

●

●●●●

●

●

●●

●●

C A2 A3 A4 A5 A6

0.45

0.50

0.55

0.60

0.65

0.70

(c) Boxplot of sampling distribution of of R for Continuous (C), Uniform (U) and Asymmetrical (A) ordinal

marginals, varying k, ρ = 0.6

Figure 3: Boxplot of sample correlation coefficient R MC distributions.

13

case of ordinal variables, the overall results are still quite good, since the coverage is near to 88%, in the worst

case considered. With regard to our experimental conditions, the best results are obtained for uniform marginal

distributions and ρ = 0.2; the worst results under asymmetrical distributions and ρ = 0.6; the number of categories

seems not to have a clear effect on the coverage, while the interval length just depends on ρ: when ρ increases,

the length decreases. These results can be roughly explained recalling that symmetrical distributed variables tend

to be “closer to the normal”, case in which Fisher’s approximation works; and this is truer when increasing the

number of categories/values the variable can assume, even if for ordinal variables this cannot be appreciated given

the small number of categories. For asymmetrical variables, the opposite point can be used. By synthesizing the

results, we can conclude that high values of ρ and an asymmetrical distribution worsen the performance of the

interval estimator for ordinal variables.

Since the results of this analysis stress that asymmetrical distributions along with high values of ρ represent

critical scenarios for the use of the normal approximation of Fisher’s transformation, when ordinal variables are in

analysis, we focus on a specific experimental condition (asymmetrical distribution and ρ = 0.6) considering small

sample sizes and checking if and how a decreasing n can further affect the results. Some findings are reported in

Table 6, they show no evidence on the role of n.

Uniform distribution, ρ = 0.2 Asymmetrical distribution, ρ = 0.2

C k = 2 3 4 5 6 k = 2 3 4 5 6

coverage 0.9506 0.9490 0.9495 0.9487 0.9515 0.9443 0.9424 0.9397 0.9275 0.9355 0.9348

ave. width 0.1682 0.1681 0.1681 0.1681 0.1681 0.1681 0.1681 0.1681 0.1681 0.1681 0.1681


C k = 2 3 4 5 6 k = 2 3 4 5 6

coverage 0.9505 0.9260 0.9413 0.9448 0.9429 0.9431 0.9230 0.9149 0.9017 0.9117 0.9181

ave. width 0.1474 0.1472 0.1473 0.1472 0.1473 0.1473 0.1472 0.1472 0.1473 0.1473 0.1473


C k = 2 3 4 5 6 k = 2 3 4 5 6

coverage 0.9507 0.8839 0.9324 0.9359 0.9401 0.9364 0.8759 0.9042 0.8809 0.8824 0.9029

ave. width 0.1126 0.1124 0.1124 0.1124 0.1125 0.1125 0.1123 0.1125 0.1126 0.1125 0.1125

Table 5: MC coverage and average width of CI for different ρ in case of continuous (C) or k-category ordinal

variables.

Of course, we present these scenarios as examples of application of the procedure and do not mean to cover

all the possible cases the researcher face in real applications; nevertheless, the general advice we can derive from

these simulations is to be careful when r is great and the sample marginal distributions are asymmetrical.

For these situations, if a coverage closer to the nominal level is requested, a different method could be used.

We suggest, for example, the following two alternatives for building a CI for ρ from a bivariate sample s of size n:

14

n,k C 2 3 4 5 6

200 0.9511 0.8767 0.9042 0.8782 0.8782 0.9057

100 0.9487 0.8647 0.9004 0.8745 0.8805 0.9050

75 0.9500 0.8636 0.9053 0.8771 0.8771 0.9039

50 0.9478 0.8697 0.9008 0.8790 0.8728 0.8990

40 0.9518 0.8743 0.8992 0.8725 0.8688 0.9021

25 0.9499 0.8614 0.8921 0.8680 0.8651 0.8983

Table 6: MC coverages for ρ = 0.6 in case of normal and asymmetrical distributions with k categories, varying n.

• select n units with replacement from s, obtain the bootstrap sample s∗ and compute on it the correlation

coefficient r∗; repeat it B times and build a bootstrap distribution{

r∗b}

. Then build a CI from the α/2 and

1−α/2 quantiles.

• determine the two marginal distributions and r on the sample s. Consider the obtained sample distributions

and r as “target” and, using the algorithm proposed in Section 3, draw B samples under those sample experi-

mental conditions, compute on each of them r∗, and thus its distribution{

r∗b}

. Then build a CI as above.

5.2. PCA vs NLPCA

When we need to construct a composite numerical indicator for a latent variable using data from a questionnaire,

the most useful and proper technique is Principal Component Analysis, in its linear (LPCA or PCA) or non-linear

(NLPCA) version. The hypothesis on which PCA relies is the linear relationship among the observed variables;

if this hypothesis fails for the data, NLPCA can be applied, allowing for a nonlinear transformation of the data

throughout an “optimal” quantifications of the categories. In both the cases, the indicator is constructed as a

linear combination of the observed variables, whose categories are quantified by a Likert scale (PCA) or optimal

quantifications (NLPCA). The outcomes of both PCA and NLPCA are:

• loadings β j, one for each variable, representing the weight of the variable in building the composite numerical

indicator;

• scores, one for each unit, representing the value of the latent phenomenon for that unit;

• maximum eigenvalue λmax of the correlation matrix of the (transformed) variables: it can be expressed as a

reproduced percentage of the total variance of the (transformed) variables;

• in addition, for NLPCA, optimal quantifications for the categories of each variable.

15

In this context, the researchers might want to answer for example these questions: “How much do PCA and

NLPCA results differ?” or “Is it worthwhile, and when, using the more complex NLPCA instead of the more

popular and easier PCA?”. The answers to these questions possibly depend on the number of variables, the number

of categories, the variables’ distribution, the association structure among the variables. By using our proposal we

can try to give some answers to these questions. In fact, many ordinal datasets can be constructed for each experi-

mental scenario; in such a way it is possible to empirically investigate if the results differ and which experimental

conditions influence more these differences. Here we mean just to show how we can proceed to compare the results

of PCA and NLPCA under different experimental conditions and give some early findings.

With this goal in mind, we fixed the experimental conditions, generate the desired ordinal matrix DDD according

to the algorithm, described in Section 3, and obtain the ordinal dataset (units × variables), then we perform both

PCA and NLPCA on it. As known, with PCA, the category labels (1, 2...) of the Likert-scale variables are treated

as related to numerical variables and these are transformed to standardized variables. With NLPCA the category

labels are transformed by the procedure into optimal numerical values (quantifications), producing variables with

zero mean and unitary variance as well.

We compare the results provided by the two analysis by three ways:

• difference in scores, by computing an euclidean distance d between PCA scores and NLPCA scores:

d =

√1n

n

∑i=1

(xiL− xiNL)2 =

√1n

n

∑i=1

x2iL + x2

iNL−2xiLxiNL =

√1n(2n−2nρL,NL) =

√2(1−ρL,NL) (6)

since ∑ni=1 xiL = ∑

ni=1 xiNL = 0, var(xiL) = var(xiNL) = 1 because of the normalization constraints. The higher

is the difference between the two methods (the distance d), the more opportune is the adoption of NLPCA.

This distance is expressed as a decreasing function of the correlation between their scores;

• the ratio λ NLmax/λ L

max between the maximum eigenvalue extracted by NLPCA and PCA on the same matrix

DDD: the ratio can take only values greater than 1, and the greater the value, the larger the advantage in using

NLPCA.

• a kind of “nonlinearity index”, built as follows:

NLI = m−m

∑i=1

ρ(qqqiL,qqqiNL) (7)

which takes value 0 if and only if the quantifications are equidistant and hence the existing relationship is

linear, while takes greater and greater values as the nonlinearity of the data is increasing. The greater is

NLI, the more feasible is the adoption of NLPCA. A slightly different normalized index of nonlinearity is

proposed in [4].

16

We considered a 4-dimensional ordinal variable, whose components have the first four distributions of Table 4,

dealing separately the two cases (uniform and asymmetrical). In a first step, we assumed the same value of corre-

lation coefficient ρ for all the four variables fixing it equal to the values 0.2, 0.4, 0.6; so obtaining three correlation

matrices RRR(0.2), RRR(0.4) and RRR(0.6) and six scenarios. Then, we moved to further six scenarios, using the same

marginal distributions but varying the correlation matrix RRR(ρ) into RRR′(ρ), having the same maximum eigenvalue of

RRR(ρ), but with unequal pairwise correlations within. The choice is obviously not unique; we considered the matrix

RRR′(ρ):

R′(ρ) =

1 a b ρ

a 1 a b

b a 1 a

ρ b a 1

with a = 2b.

For each of the 12 scenarios, by using our procedure, S = 2,000 Monte Carlo samples of size n = 500 are

drawn. For each scenario and each sample s we calculate the distance d(s), as defined in (6), then we yield the

empirical distribution function, Fd . The functions Fd , one for each of the twelve scenarios, are plotted in Figure

4. The Figure points out that the difference in the scores and hence in results between PCA and NLPCA is greater

for smaller values of ρ and, in lower magnitude, for asymmetrical distributions. In fact, a small value of ρ means

that the data do not present a strong linear structure, and then NLPCA might be more proper than PCA, because

it could catch a potential nonlinear structure. With regard to the second finding, it may be noted that if marginal

distributions are asymmetrical the scores are more sensitive to the difference in values of the categories, so once

again NLPCA seems more suitable. The variability of ρi j’s within the matrix RRR′ (for the same λmax) does not seem

to affect the results, confirming an analogous finding in [16], obtained following a different approach. From this

simple simulation study, the factor that influences most the difference in scores between PCA and NLPCA is clearly

the magnitude of ρ .

The values of the ratio λ NLmax/λ L

max and NLI were also computed on matrices DDD under each scenario and for each

Monte Carlo run. In Figure 5 the boxplot of the ratio varying matrices RRR(0.2), RRR(0.4) and RRR(0.6) are displayed. As

one can see, reducing the value of ρ and passing from uniform to asymmetrical distribution the ratio increases and

makes NLPCA more suitable.

Finally, in Figure 6, the boxplot of NLI under the same 12 scenarios is provided. The index increases when

reducing ρ and passing from uniform to asymmetrical distributions, thus confirming the previous findings (NLPCA

is preferable to PCA when the linear correlation coefficients are small and the distribution is asymmetrical).

17

0.00 0.02 0.04 0.06 0.08

0.0

0.2

0.4

0.6

0.8

1.0

R(0.2)R(0.4)R(0.6)

(a) Uniform marginal distribution and correlation matrix RRR with

constant ρ inside

0.00 0.02 0.04 0.06 0.08

0.0

0.2

0.4

0.6

0.8

1.0

R(0.2)R(0.4)R(0.6)

(b) Uniform marginal distribution and correlation matrix RRR′ with

unequal ρ inside

0.00 0.02 0.04 0.06 0.08

0.0

0.2

0.4

0.6

0.8

1.0

R(0.2)R(0.4)R(0.6)

(c) Asymmetrical marginal distribution and correlation matrix RRR

with constant ρ inside

0.00 0.02 0.04 0.06 0.08

0.0

0.2

0.4

0.6

0.8

1.0

R(0.2)R(0.4)R(0.6)

(d) Asymmetrical marginal distribution and correlation matrix RRR′

with unequal ρ inside

Figure 4: Empirical distributions of difference between PCA and NLPCA scores under different scenarios

18

●●

●

●●

●●●●

●

●

●

●

●

●

●●

●●●●●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●●

1.00

1.01

1.02

1.03

1.04

1.05

1.06

(a) Uniform distributions,

ρ = 0.2

●●●●●●

●

●●●●●

●

●●

●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●●

●●●●●●●

1.00

1.01

1.02

1.03

1.04

1.05

1.06

(b) Uniform distributions,

ρ = 0.4

●●●

●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●

1.00

1.01

1.02

1.03

1.04

1.05

1.06

(c) Uniform distributions,

ρ = 0.6

●

●

●

●

●

●

●●

●●

●●

●●

●●

●

●●●

●

●

●

●

●●●●●●

●●

●

●●●

●

●●

●

●

●●

●

●●

●

●

●●●●

●

●●

●●

●

●●●

●

1.00

1.01

1.02

1.03

1.04

1.05

1.06

(d) Asymmetrical distribu-

tions, ρ = 0.2

●●

●

●

●

●●

●

●

●

●●●●●●●●●●●●

●●

●

●●●●

●

●

●

●●

●

●●●●

●

●●●

●

1.00

1.01

1.02

1.03

1.04

1.05

1.06

(e) Asymmetrical distribu-

tions, ρ = 0.4

●●●●●●

●

●●

●

●

●●●●●●●●●●●●●●●

●

●●

●

●●●●●●●●●●

1.00

1.01

1.02

1.03

1.04

1.05

1.06

(f) Asymmetrical distribu-

tions, ρ = 0.6

Figure 5: Boxplot of the ratio λ NLmax/λ L

max under different scenarios

19

●

●

●●

●

●●

●●●●

●●●●

●

●

●

●●

●

●

●

●

●

●●

●

0.0

0.1

0.2

0.3

(a) Uniform distributions,

ρ = 0.2

●●●●●●

●

●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0.0

0.1

0.2

0.3

(b) Uniform distributions,

ρ = 0.4

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●

0.0

0.1

0.2

0.3

(c) Uniform distributions,

ρ = 0.6

●

●●●●

●

●

●

●

●

●●

●●

●●

●●●●●●●

●

●●●●●

●

●

●

●

●

●●●●

●

●

●

●●

●

●

●●●

●

●

●

●●●

●

0.0

0.1

0.2

0.3

(d) Asymmetrical distribu-

tions, ρ = 0.2

●

●●●●●●

●

●

●

●●●●●●●

●

●

●

●●

●

●●●

●

●●●

●

●●●●●

●

0.0

0.1

0.2

0.3

(e) Asymmetrical distribu-

tios, ρ = 0.4

●●●

●

●

●

●●●

●

●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●

0.0

0.1

0.2

0.3

(f) Asymmetrical distribu-

tions, ρ = 0.6

Figure 6: Boxplot of NLI under different scenarios

20

6. Conclusions

A method for generating ordinal data matrices under fixed experimental conditions, such as marginal distri-

butions and correlation matrix, is proposed. This method can be seen as a technique that transform multivariate

normal variables into other random variables with partially assigned distribution, and is specifically addressed to

ordinal variables. It starts from a multivariate standard normal variable ZZZ with a known correlation matrix; from ZZZ a

random sample is drawn and the sample values are transformed into ordinal ones according to the desired marginal

distributions. The correlation matrix of ZZZ is properly computed by an original algorithm in order to assure the

required correlation matrix for the ordinal variables.

The proposal is very flexible since it allows the researcher to choose the desired number of categories and

the marginal distribution for each ordinal variable, and the association structure among variables themselves, by

entering the desired correlation matrix and checking for its feasibility. The procedure can be used to simulate

samples from ordinal variables with a pre-specified structure and is suitable, for example, to check the behavior

or performance of estimators, tests, and statistical methods of analysis or to compare different methods varying

experimental conditions.

To show its applicability two simulation studies are carried out, one addressed to an inferential problem, the

other related to an exploratory analysis. The first aims at setting-up a sampling distribution for the correlation co-

efficient in the ordinal case and trying out the effectiveness of the procedure usually adopted under some scenarios;

in the second, a comparison between the results of PCA and NLPCA for building a composite indicator is proposed

and some suggestions about the choice between them are given. Both applications show how the method works

and how its application is easy. Moreover, since this method is implemented in R and is a very user-friendly tool,

it can also be helpful for both research and teaching purposes.

References

[1] Biswas, A. (2004) Generating correlated ordinal categorical random samples, Statistics & Probability Letters

70(1): 25-35

[2] Bollen, K.A., Barb, B.H. (1981) Pearson’s R and Coarsely Categorized Measures, American Sociological

Review 46(2): 232-239

[3] Cario, M. C., Nelson, B.L. (1997) Modeling and generating random vectors with arbitrary marginal dis-

tributions and correlation matrix. Technical report, Department of Industrial Engineering and Management

Sciences, Northwestern University, Evanston, Illinois

[4] Carpita, M., Manisera, M., On the nonlinearity of homogeneous ordinal variables. In S. Ingrassia, R. Rocci,

21

M. Vichi (Eds.), New Perspectives in Statistical Modeling and Data Analysis. Springer Series “Studies in

Classification, Data Analysis, and Knowledge Organization”, ISBN: 978-3-642-11362-8 (2011)

[5] Charmpis, D.C., Panteli, P.L. (2004) A heuristic approach for the generation of multivariate random samples

with specified marginal distributions and correlation matrix, Computational Statistics 19: 283-300

[6] Demirtas, H. (2009) A method for multivariate ordinal data generation given marginal distributions and

correlations, Journal of Statistical Computation and Simulation 76(11): 1017-1025

[7] Fisher, R.A. (1921) On the “probable error” of a coefficient of correlation deduced from a small sample,

Metron 1: 3-32

[8] Ferrari, P.A., Annoni, P., Barbiero, A., Manzi, G. (2011) An imputation method for categorical variables

with application to nonlinear principal component analysis, Computational Statistics & Data Analysis 55(6):

2410-2420

[9] Frechet, M. (1951) Sur les tableaux de correlation dont les marges sont donnes, Ann. Univ. Lyon, Section A,

Series 3(14): 53-77

[10] Guilford, J. P., Fundamental Statistics in Psychology and Education. New York: McGraw-Hill (1965).

[11] Iman, R.L., Conover, W.J. (1982) A Distribution-Free Approach to Inducing Rank Correlation Among Input

Variables, Communications in Statistics: Simulation Computation 11(3): 311-334.

[12] Higham, N. (2002) Computing the nearest correlation matrix - a problem from finance, IMA Journal of

Numerical Analysis 22: 329-343

[13] Johnson, M. E. 1987. Multivariate Statistical Simulation. New York: John Wiley.

[14] Li, S.T., Hammond, J.L. (1975) Generation of pseudorandom numbers with specified univariate distributions

and correlation coefficients, IEEE Transactions on Systems, Man and Cybernetics 5: 557-561

[15] Lurie, P.M., Goldberg, M.S. (1998) An Approximate Method for Sampling Correlated Random Variables

from Partially-Specified Distributions, Management Science 44(2): 203-218

[16] van Rijckevorsel, J.L.A., Bettonvil, B., de Leeuw, J., (1985). Recovery and Stability in Non-linear Principal

Components Analysis. University of Leiden: Department of Data Theory.

[17] Ruscio, J., Kaczetow, W. (2008) Simulating Multivariate Nonnormal Data Using an Iterative Algorithm,

Multivariate Behavioral Research, 43(3): 355-381

22

[18] Schweizer, B., Wolff, E. F. (1981) On nonparametric measures of dependence for random variables, Annals

of Statistics 9: 879-885

[19] Stanhope, S. (2005) Case studies in multivariate-to-anything transforms for partially specified random vector

generation, Insurance: Mathematics and Economics 1(1): 68-79

23

working paper n. 2011-38 2011wp.demm.unimi.it/tl_files/wp/2011/demm-2011_038wp.pdf · working paper...

Documents