defining multiple criteria for meaningful outcome in routine outcome measurement using the health of...

ORIGINAL PAPER

Defining multiple criteria for meaningful outcome in routineoutcome measurement using the Health of the Nation OutcomeScales

Alberto Parabiaghi • Hans E. Kortrijk •

Cornelis L. Mulder

Received: 10 June 2013 / Accepted: 24 July 2013 / Published online: 6 August 2013

� Springer-Verlag Berlin Heidelberg 2013

Abstract

Purpose Using the reliable and clinically significant

change approach, we aimed to identify meaningful out-

come indicators for the Health of the Nation Outcome

Scales (HoNOS) and to combine them in a single model.

We applied these indicators to the 1-year outcome of two

large samples of people attending community mental

health services in Italy (cohort 1) and the Netherlands

(cohort 2).

Methods Data were drawn from two studies on routine

outcome assessment. The criteria for meaningful outcome

were defined on both study cohorts and both language

versions of the scale. The model combined (a) two criteria

for adequate change (at least 4 or 8 points change), (b) two

cut-offs for clinically significant change (a total score of 10

was the threshold between mild and moderate illness, 13

between moderate and severe illness), and (c) a method for

classifying stable subjects in three degrees of severity

(stable in mild, moderate or severe illness). Results were

compared with those given by the effect size (ES) and

analysis of variance and covariance (ANOVA and

ANCOVA).

Results For the proposed approach the outcome of cohort

1 was better than cohort 2, with 65–67 % of its subjects

showing a positive outcome compared to only 45–46 %.

The other reference methods (ES and ANOVA), however,

showed a greater improvement for cohort 2. ANCOVA

indicated that the differences were due to regression to the

mean (RTM) which showed opposite effects across the two

cohorts.

Conclusions The proposed approach proved valuable and

generalizable for interpreting outcome on HoNOS, scarcely

influenced by the RTM effect. Its introduction could benefit

outcome evaluation and management.

Keywords Meaningful outcome � Reliable change �Routine outcome measurement � Mental health �HoNOS � Regression to the mean

Introduction

The importance of routine outcome measurement (ROM)

in mental health care has been recognized worldwide. It

has been officially adopted in the UK, The Netherlands,

Australia and New Zealand and is being implemented in

many other countries [1–4]. A central feature of outcome

assessment is its ability to measure clinically significant

change over time [5–7]. The so-called pre-post design is

the simplest and most straightforward way to highlight

meaningful change. It requires at least two assessments,

usually at the start and at the end of treatment. Thus,

change is measured using the two assessments through the

so-called two-wave difference-score (d-score) approach.

Information on change can be delivered at the individual

and the group level [8]. Since the two levels are closely

linked, for most stakeholders information on both is of

A. Parabiaghi (&)

Laboratory of Epidemiology and Social Psychiatry, IRCCS

Istituto di Ricerche Farmacologiche Mario Negri, Via la Masa

19, 20156 Milan, Italy

e-mail: [email protected]

H. E. Kortrijk

Bavo-Europoort, Mental Healthcare Organization, Prins

Constantijnweg 48-54, 3066 TA Rotterdam, The Netherlands

C. L. Mulder

Department of Psychiatry, Epidemiological and Social

Psychiatric Research Institute, Erasmus MC, P.O. Box 2040,

3000 CA Rotterdam, The Netherlands

123

Soc Psychiatry Psychiatr Epidemiol (2014) 49:291–305

DOI 10.1007/s00127-013-0750-7

primary importance [8, 9]. Group data can be deductively

used to form criteria for assessing individual clinically

significant change, while individual data can be inductively

used to define meaningful change at the group level. Group

level results can inform interpretation at the individual

level, and vice versa [10].

While information on change at the group level is useful

for judging overall treatment effectiveness, there is an

overwhelming agreement on the inappropriateness of sta-

tistical significance alone to evaluate change in outcome

research [8]. Alternative methods may provide more infor-

mation on the degree of individual meaningful change

within groups of patients. The so-called ‘‘classify and

count’’ approaches can be adopted to quantify the proportion

of patients who gain d-score meaningful change at follow-

up. These approaches involve classifying severity based on

the number of individual items with a certain severity score

[6, 7]. Although preferable to statistical methods, which are

based on the likelihood of change, these methods are not

self-sufficient since they still need a statistical approach to

identify significant differences between groups. Moreover,

they depend on the availability of reliable, valid criteria to

classify individuals as clinically improved or worsened.

The criteria for meaningful change can be based on

anchor- or distribution-based approaches [5, 11, 12].

Anchor-based approaches require independent standards or

‘anchors’ and involve comparing groups that differ in

terms of some disease-related external criterion in order to

identify a cut-off for clinical significance. Three alterna-

tives for determining clinically significant (CS) change

have been proposed: (a) the patient’s score after clinical

intervention is two standard deviations (SD) or more from

the dysfunctional population mean in the functional

direction; (b) the patient’s score after clinical intervention

is within two SD of the functional population mean; (c) the

patient’s score after clinical intervention is closer to the

mean of the functional population than the mean of the

dysfunctional population [13–20]. Only the third criterion

for clinical significance, based on the greater likelihood of

the patient being in the normative than in a clinical dis-

tribution, is anchor-based [21].

Distribution-based approaches are based on the statisti-

cal characteristics of the sample. Three broad types of

distribution-based measures can be identified: based on

(a) statistical significance; (b) sample variability; and

(c) measurement precision [5, 11].

The four candidate approaches for interpreting individ-

ual meaningful change in mental health care are effect size

(ES), standard error of measurement (SEM), reliable

change (RC), and reliable and clinically significant change

(RCSC) [6–8]. Although they are distinct statistics, Bur-

gess et al. [22] showed that ES-medium and SEM gave the

same thresholds for change.

Cohen’s ES is a popular and recommended distribution-

based approach and relies on a measure of variability to

identify group-level meaningful change [23, 24]. The

techniques developed for using methods like ES to assess

change for individuals do not take individual test error into

account [6].

The SEM adopts a distribution-based criterion to iden-

tify minimal individual clinically significant change, but it

is too directly related to the reliability of the test [25]. The

RC approach is closely related to the SEM as they both

explicitly incorporate the reliability of the measure. How-

ever, the RC is more conservative because it also considers

whether the change is statistically reliable. Internal con-

sistency, Cronbach’s coefficient alpha (a), was mostly

adopted as a parameter of the reliability of measures for

calculating SEM and RC.

Jacobson and Truax [15] proposed combining RC with

clinically significant change (CSC) in order to identify

individual meaningful change [13, 14, 16, 17, 18, 19, 20].

This approach comprises two individual-level methods for

identifying meaningful change: a distribution-based

approach (i.e. RC) and an anchor-based approach (i.e.

CSC). The Jacobson and Truax (JT) method [13, 15] was

considered one of the most promising for establishing

clinically meaningful change and has been widely used to

identify individual clinical improvement with different

outcome scales and in various settings [26–32].

Many authors have criticized the original RCSC concept

focusing on regression toward the mean, which was not

taken into consideration, and on the fact that functional/

dysfunctional populations are likely to differ in many other

relevant variables besides the outcome considered [16, 18–

20, 33]. The discussion remarked on the importance of

baseline severity in determining clinically meaningful

change [12]. Several formula modifications in the RC

index, including changes in the formula for calculating RC

[16, 18–20], have been suggested to correct the estimates

for error-based regression to the mean, by adjusting the

pre-test score for regression to the mean [34], or by cor-

recting change scores according to the reliability of the

differences [18–20].

Despite all these criticisms the original JT approach is

sound, practical and easy to understand and can therefore

be considered the most appropriate method in the domain

of outcome assessment as a means for communicating with

patients, families and service providers [9]. However, its

low sensitivity to change still poses major limits to its use

in group-level outcome evaluation.

In previous reports on the application of RCSC in ROM,

the majority (70–90 %) of patients appeared stable across

time and this was attributed to the conservativeness of the

approach [30–32]. Too conservative an approach might fail

to adequately inform stakeholders on clinically meaningful

292 Soc Psychiatry Psychiatr Epidemiol (2014) 49:291–305

123

change as only a scant minority of patients would be clas-

sified as ‘‘improved’’ or ‘‘worsened’’ and the outcome of the

majority ‘‘unchanged’’ would be hard to interpret. One

solution is to adopt a combined model comprising multiple

criteria for change. Matthey et al. [29] proposed using the

RC index to detect improvement or deterioration and both

the RC index and CS change to establish recovery. Parab-

iaghi et al. [30] proposed adopting RC to indicate

improvement or deterioration and RC plus two different CS

cut-offs to establish clinical improvement or deterioration

(cut-off1) and remission or recurrence (cut-off2). Both

approaches took into account different degrees of change in

the direction of either improvement or worsening with the

result of conveying much more information about the

clinical relevance of group-level outcome [29, 30]. Infor-

mation on outcome should focus on meaningful change.

However, as ‘no change’ may be regarded as either a good

outcome or a disappointing one, the characteristics of

clinical stability should also be taken into account in

interpreting outcome in everyday mental health practice [8].

We aimed at identifying and validating a generalizable

model for assessing clinically meaningful outcome (MO)

in ROM which adopted both a distribution-based approach

(i.e. SEM or RC) and an anchor-based approach (CS cut-

offs), and which took account of different degrees of

individual change and of different levels of severity within

illness stability. The aims were (a) to calculate common

multiple criteria for meaningful outcome on the Italian and

the Dutch versions of the Health of the Nation Outcome

Scales (HoNOS); (b) to integrate those criteria in a single

model, creating a combined method for interpreting

meaningful outcome; (c) to apply the model to two large

cohorts of people attending community mental health ser-

vices in Italy and the Netherlands; (d) to calculate the

measure of agreement between the results obtained with

the proposed approach and those given by the calculation

of the individual ES; (e) to compare these outcome results

with those from other reference methods, ES, ANOVA and

ANCOVA.

Methods

Study samples

Italian data were drawn from three observational outcome

studies on ROM in a group of Italian community mental

health services (CMHSs) between 2003 and 2011 [30, 32].

Each of these studies selected a prevalence cohort of

people attending those services, followed up for 1 year in a

naturalistic fashion. We used baseline HoNOS data (Ho-

NOS total score, standard deviation and Cronbach’s alpha)

from patients in these studies to identify multiple criteria

for meaningful change. For empirical application of the

model, data were drawn only from the third Italian ROM

study. We used baseline and 1-year follow-up HoNOS

ratings of all complete cases (3,526; cohort 1).

Dutch data were obtained from seven Assertive Com-

munity Treatment (ACT) teams in the city of Rotterdam,

the Netherlands, as part of a ROM procedure [35, 36].

ROM assessments were planned every 6–12 months. The

ratings were completed by independent raters (mostly

psychologists), and were used in clinical practice to discuss

treatment progress with the patient. Criteria for treatment

by an ACT team were (1) age 18 or older, (2) having a

severe mental illness, usually a psychotic disorder, with or

without a co-morbid substance use disorder (SUD), and (3)

lack of motivation for regular treatment at the start of ACT

that made assertive outreach necessary. The first contact

with mental health services started about a decade before

entering ACT [35, 36]. For the empirical application of

model, we used baseline and 1-year follow-up data of

patients with complete cases (805; cohort 2).

Outcome data were collected as part of routine clinical

care and for purposes of quality improvement, thus no

specific written informed consent was obtained.

Instrument

The Health of the Nation Outcome Scales (HoNOS) was

developed as a standardized assessment tool for routine use

to evaluate treatment progress in mental health services

[37]. It consists of 12 clinician-rated scales, each using five

points ranging from 0 (no problem) to 4 (severe/very

severe), yielding a total score from 0 to 48. The HoNOS

covers the following domains: (1) overactive, aggressive,

disruptive or agitated behaviour, (2) non-accidental self-

harm, (3) problem drinking and drug-taking, (4) cognitive

problems, (5) physical illness and disability, (6) halluci-

nations and delusions, (7) depressed mood, (8) other psy-

chological symptoms, (9) relationship problems, (10)

problems with activities of daily living, (11) problems with

living conditions, and (12) problems with occupation and

activities. Independent studies have evaluated its reliabil-

ity, sub-scale structure, sensitivity to change and appro-

priateness for routine clinical use in busy psychiatric

services; see Pirkis et al. [38] for a review. The psycho-

metric properties of the English, Italian and Dutch HoNOS

versions have been found to be acceptable [37, 39–41]. The

Italian and Dutch versions of HoNOS were adopted for

cohorts 1 and 2 respectively [39–41].

Statistical methods

This study comprised two groups of patients (Italian and

Dutch). We started by calculating descriptive statistics and

Soc Psychiatry Psychiatr Epidemiol (2014) 49:291–305 293

123

examined normality plots of the data to ascertain that the

HoNOS ratings were normally distributed (normal Q–Q

Plot). Then we compared the two groups using Pearson’s

Chi-square tests (for categorical data) and independent

sample t tests (for non-categorical data).

The SEM, reliable change (RC index) and two cut-offs

for clinical significance were calculated to identify multiple

criteria for meaningful outcome for the HoNOS. We fol-

lowed a procedure similar to that described by Parabiaghi

et al. [30] and acknowledged by the Australian Mental

Health Outcomes & Classification Network [6, 22].

As both indicators of meaningful change take

account of the reliability and variability of measures

(i.e. Cronbach’s alpha and the standard deviation), we

first needed to find consensus and select parameters that

could be applied (a) to the two different language

versions of HoNOS and (b) to the two different cohorts

of service users. For this purpose we generated a series

of calculation tables displaying the SEM/RC indexes

obtained for a wide range of reliability and dispersion

characteristics (see ‘‘Appendix’’ for calculation formu-

las) which enabled us to select common multiple cri-

teria for meaningful change.

For the RC index alpha was set at 0.10 (two-tailed). This

means that change scores above the upper threshold were

considered as reliable improvement, and scores below the

lower threshold were considered as reliable deterioration

(with 90 % confidence). Because individual total scores

can only change in points (integers), the calculated

thresholds were rounded to the next integer.

The measure of agreement among RC, SEM and indi-

vidual ES (Es-ind) was evaluated on the aggregated sample

(i.e. both sites) using a 5 9 5 agreement table and Cohen’s

kappa.

A classify-and-count approach was applied to the

baseline assessments of cohorts 1 and 2 (See [9] and [32]

for details) for calculating the cut-offs for clinical signifi-

cance. We stratified our subjects into four categories of

severity: subclinical sample, clinical sample (mild), mod-

erately severe sample, very severe sample. As proposed by

Lelliott [42] a score of[2 in at least one item was adopted

to distinguish between severe and non-severe patients.

Severity was further distinguished by dividing the group of

severe subjects into those having more than one item’s

score of[2 (‘very severe’) and those with only one item’s

score [2 (‘moderately severe’). The non-severe patients

were further divided as ‘subclinical’ (score \ 2 in all

items) and ‘clinical/mild’ (at least one item’s score of 2).

Then, we calculated the cut-off points where the chance of

either distribution was the same (see ‘‘Appendix’’). Cut-

offs were calculated between ‘‘very severe’’ and the rest of

the sample (cut-off 1) and between ‘‘subclinical’’ and the

rest of the sample (cut-off 2). In addition to these

classifications we proposed to define ‘remission’ as: no

more than minimal to mild problems in psychosocial

functioning. Patients classified as mildly ill achieved

remission if all HoNOS items were scored B2 over a period

of 6 months or more.

Empirical evaluation of the proposed model involved

applying of the calculated criteria to the HoNOS assess-

ments of both study cohorts. Next, study outcomes of both

cohorts were calculated with classical statistical tests (two-

tailed paired t test, mixed-group factorial ANOVA) and

individual and group ES. To allow comparisons of the

‘clinical model’ with the other conventional methods we

created an ordinal variable ranging from -3 (worsened in/

to severe illness) to ?4 (stable in mild illness) based on the

RCSC index. The measure of agreement between the

RCSC index and individual ES was evaluated based on the

Spearman’s rho.

Subsequently, we examined the degree to which con-

clusions about outcomes generated by the different meth-

ods were comparable. For proper interpretation of the

results of this descriptive comparison (RCSC vs classical

approaches), we estimated the effect of the regression to

the mean (RTM) on outcome, based on the hypothetical

assumption that the subjects in the two cohorts were drawn

from the same population. Thus, the mean of the aggregate

sample (i.e. both sites) was used as an estimate of the

population ‘true mean’.

All the subjects were stratified on the baseline score: (1)

lower than average (scores \ 1SD); (2) average; (3) higher

than average (scores [ 1SD). Change-scores within each

strata were calculated to analyse whether the differences

between strata could be attributed to a RTM effect. The

effect was then estimated as described by Barnett et al. [43]

(see ‘‘Appendix’’ for formula) and the observed change

was corrected by subtraction (or by addition) of the esti-

mated RTM effects [44, 45].

RTM showed a substantial opposite effect across the

two cohorts. Therefore, we corrected the results of the

conventional statistical approaches by applying two

ANCOVA models. First, as described by Barnett et al. [43],

we used the following single regression equation to adjust

each subject’s outcome according to their baseline mea-

surement: follow-up = constant ? a (baseline - baseline

mean) ? b group ? error. In this equation, ‘baseline -

baseline mean’ represents the baseline assessment centred

on the ‘true mean’. Second, as the reliability of HoNOS

assessments (Cronbach’s alpha) differed across the two

sites (0.74 vs 0.56) a non-equivalent group analysis

approach was used to obtain reliability-corrected results

(see http://www.socialresearchmethods.net/kb/statnegd.

php for further explanation): follow-up = constant ? a

(reliability-corrected baseline - baseline mean) ? b

group ? error [46].


123

http://www.socialresearchmethods.net/kb/statnegd.php

http://www.socialresearchmethods.net/kb/statnegd.php

Results

The study cohorts

The Italian sample (3,526) was much larger than the Dutch

sample (805) and therefore dominant in the aggregated

analyses (i.e. both sites) (Table 1). The two cohorts showed

some significant differences. The Italian patients (cohort 1)

were older (t = 15.276, df = 1,477.945, p \ 0.001), less

often men (v2 = 228.741, df = 1, p \ 0.001; OR 0.28,

95 % CI 0.236–0.333) and had lower HoNOS total scores

at baseline (t = -28.532, df = 1,317.998, p \ 0.001) and

follow-up (t = -23.904, df = 1,347.818, p \ 0.001).

There were significant differences in primary diagnosis:

Dutch ACT patients (cohort 2) were more frequently

diagnosed with a psychotic disorder (v2 = 25.215, df = 1,

p \ 0.001), a substance-related disorder (v2 = 149.853,

df = 1, p \ 0.001) and to the ‘deferred’ or ‘other diagno-

sis’ categories (v2 = 65.793, df = 1, p \ 0.001). Italian

patients were more often diagnosed with an affective dis-

order (v2 = 214.65, df = 1, p \ 0.001). Follow-up was

somewhat longer in the Italian sample (t = 7.914,

df = 832.877, p \ 0.001).

Criteria for meaningful change

We first drew data from the baseline assessments of the

three consecutive Italian outcome studies and the Dutch

ROM study [30, 32, 35, 36]. The Cronbach’s alphas (a) and

standard deviations (SDs) of the HoNOS total scores were

respectively: a = 0.72 and SD = 6.2 (n = 1,413; HoNOS-

3); a = 0.75 and SD = 6.4 (n = 844; HoNOS-4); a = 0.74

and SD = 6.5 (n = 3,556; HoNOS-5); a = 0.56; SD = 5.3

(Dutch ROM data). Table 2 reproduces an extract of the

SEM calculation tables. For a given level of alpha and a SD

value falling in the indicated interval, the table indicates the

minimum and maximum SEM values and the corresponding

ESs. Cronbach’s alpha values and the SD intervals of the

four samples were consistent with a four-point change

threshold (i.e. SEM-max = 4). A change of at least four

points in HoNOS total score from baseline to follow-up

would correspond to an ES from 0.50 to 0.89 (Table 2).

Thus, a minimally detectable change of at least four points

(SEM) and, consequently, a reliable change of at least eight

points (RCI 90 %) were suitable for all samples and thus

appropriate for the aims of the present study.

The cut-offs for clinical significance were calculated on the

baseline assessments of cohorts 1 and 2. The samples were

considered both separately and merged. The calculations gave

comparable results. As a result, the threshold between mod-

erate and severe illness was 13 (cut-off 1) and the threshold

between moderate and mild illness was 10 (cut-off 2).

The RCSC criteria for meaningful change were applied

to the HoNOS ratings of the two cohorts, considered sep-

arately and merged. In the merged sample, Cronbach’s

alpha for the baseline HoNOS ratings was 0.75 and the

mean total score was 11.61 (SD 6.49). These characteristics

too were consistent with a SEM of four points and 90 %

RC of 8 points (see Table 2). The change thresholds in the

HoNOS total scores for small, medium and large ES were

respectively 2, 4 and 6 [23].

Agreement between SEM/RC and ES-ind

The agreement between the SEM/RC and ES-ind was

evaluated using a 5 9 5 agreement table (see Table 3) and

Table 1 Patients’ main characteristics (4,331 cases)

Cohort 1a Cohort 2b Both sitesc

Age, years (SD) 47.1 (13.5) 40.2 (11.2) 45.7 (13.3)

Males (%) 46.2 75.4 51.6

Primary diagnosis (%)

Psychotic disorders 56.1 65.7 57.9

Substance abuse/

dependence

0.3 7.5 1.4

Affective disorders 32.9 5.8 28.2

Deferred and other 10.7 21.0 12.6

Follow-up, years (SD) 1 (0.1) 0.9 (0.6) 1 (0.3)

HoNOS baseline,

mean (SD)

10.5 (6.2) 16.7 (5.4) 11.6 (6.5)

HoNOS follow-up,

mean (SD)

9.2 (6.1) 14.2 (5.2) 10.1 (6.3)

Duration of illness,

median (IQR)

11 (5–18)d 8.2 (2.4–15.5)e 11 (4–18)

a Italian ROM study (3,526)b Dutch ROM study (805)c Merged sampled n = 583 missinge n = 344 missing

Table 2 Standard error of measurement (SEM) calculation table for

a given alpha and standard deviation (SD)

Alphaa SD-

minbSD-

maxbSEMc-

min

SEMc-

max

ESd

interval

0.55 4.47 5.96 3 4 0.67–0.89

0.60 4.74 6.32 3 4 0.63–0.84

0.65 5.07 6.76 3 4 0.59–0.79

0.70 5.48 7.30 3 4 0.55–0.73

0.75 6.00 8.00 3 4 0.50–0.67

0.80 6.71 8.94 3 4 0.45–0.60

a Cronbach’s alphab Standard deviationc Standard error of measurementd Effect size


123

Cohen’s Kappa, which resulted in a moderate degree of

agreement (k = 0.492; p \ 0.0001). Cross-tabulation of

the SEM/RC and ES-ind showed that for 2,646 subjects

(61 %) the two approaches agreed in identifying stability

or meaningful improvement and deterioration. The ES-ind

criteria were less conservative than SEM/RC criteria.

Misclassified patients who did not meet criteria for change

on the SEM (2,416) met the criteria for small change

(improvement and deterioration) on the ES-ind (respec-

tively 741 and 457). In contrast, patients who met the

criteria for change on the SEM (respectively 821 and 381)

also met the criteria for great change on the ES-ind

(respectively 337 and 150). Among the 2,416 subjects who

were stable for the SEM/RC classification, 1,198 (50 %)

showed a small to moderate change for ES-ind either in the

direction of worsening (457; 38 %) or improvement (741;

62 %). Patients meeting the criteria for great improvement

according to the ES-ind also showed a reliable change

(RC90 %) in only 59 % of the cases.

Application of the combined model

for the interpretation of meaningful outcome

The application of the combined criteria (RC and SEM

with CS cut-offs) on both cohorts’ follow-up resulted in the

formation of 23 outcome groups (Table 4). Figure 1 illus-

trates the 1-year change of patients from both sites.

As discussed above, one of the main problems in clinical

use of the criteria for RCSC is dealing with a high proportion

of ‘‘unchanged’’ subjects [30–32]. The solution we chose

was to divide the group of those who were stable for the SEM

and the RC criteria into three subgroups based on the mean

between the baseline (T0) and the follow-up (T1) score.

Therefore, we added two parallel diagonal lines orthogonal

to the SEM and RC diagonal bands for ‘no change’ to the

original graph by JT. These diagonals divide the bands of ‘no

change’ into three parts: mild illness, moderate illness, and

severe illness (see also Figs. 1, 2) [15]. Thus, the

‘‘unchanged’’ subjects were divided into three subgroups:

those falling to the left of the lower diagonal line had a mean

score (T0 ? T1/2) lower than 10 and were classified as

‘mildly ill’; those falling to the right of the upper diagonal

line had a mean score higher than 13 and were classified as

‘severely ill’; those falling between the two diagonal lines

had a mean score between 10 and 13 and were classified as

‘moderately ill’. This approach takes account of baseline

and follow-up HoNOS ratings and is applied only to those

without a minimal or reliable change at follow-up.

The ‘remitted’ patients (see ‘‘Methods’’) were a sub-

group of those who were ‘‘stable within mild illness’’

(Table 1).

The combined criteria for meaningful outcome (MO

model) resulted in the formation of 23 outcome groups

(Table 4; Fig. 1). This produced a very comprehensive but

also fairly complex model which we propose for research

(‘research model’). For the purposes of the present study

(i.e. group-level outcome interpretation) we developed a

simplified model (eight outcome groups) which used only

the SEM criterion and the two cut-offs of clinical signifi-

cance (Table 5; Fig. 2). In this ‘clinical model’ the cate-

gory of ‘mild illness’ comprised both those who were

unchanged in mild illness and those who were mildly ill at

baseline with no change at follow-up. Those who showed a

meaningful improvement were entered in the following

three categories: ‘improvement to mild illness’, ‘improve-

ment to moderate illness’, and ‘improvement within severe

illness’ (respectively for mild, moderate or severe illness at

follow-up). Those who showed a meaningful worsening at

follow-up were divided into two categories: ‘worsening to

moderate illness’ and ‘worsening in or to severe illness’.

Empirical evaluation of the MO model

For the ‘research model’ cohort 1 had a better outcome than

cohort 2 (Table 4). 64.7 % of the Italian patients had a

positive outcome compared to only 45.8 % of the Dutch

Table 3 Proportions of meaningfully improved or worsened patients according to SEM/RC and individual ES (ES-ind)

ES-ind classification SEM/RC classification Total

Worsened

(RC90 %)

Worsened

(SEM)

Stable (no

SEM)

Improved

(SEM)

Improved

(RC90 %)

Worsened (DES great change) 209 (58.2 %) 150 (41.8 %) 0 (0.0 %) 0 (0.0 %) 0 (0.0 %) 359 (100.0 %)

Worsened (DES small to moderate

change)

0 (0.0 %) 231 (33.6 %) 457 (66.4 %) 0 (0.0 %) 0 (0.0 %) 688 (100.0 %)

Stable (DES no change) 0 (0.0 %) 0 (0.0 %) 1,218 (100.0 %) 0 (0.0 %) 0 (0.0 %) 1,218 (100.0 %)

Improved (DES small to moderate

change)

0 (0.0 %) 0 (0.0 %) 741 (60.5 %) 484 (39.5 %) 0 (0.0 %) 1,225 (100.0 %)

Improved (DES great change) 0 (0.0 %) 0 (0.0 %) 0 (0.0 %) 337 (40.1 %) 504 (59.9 %) 841 (100.0 %)

Total 209 (4.8 %) 381 (8.8 %) 2,416 (55.8 %) 821 (19.0 %) 504 (11.6 %) 4,331 (100.0 %)

Values in italic indicate correct cross-classifications; values in bold indicate incorrect cross-classifications (4,331; counts and percent rows)


123

Ta

ble

4C

om

bin

edm

od

elfo

rth

ein

terp

reta

tio

no

fm

ean

ing

ful

ou

tco

me

(MO

):ex

ten

siv

ecl

assi

fica

tio

nfo

rre

sear

chp

urp

ose

s

MO

clas

sifi

cati

on

—‘r

esea

rch

mo

del

’

Co

ho

rt1

aC

oh

ort

2b

Bo

thsi

tesc

Su

bje

cts

(%)

Ho

NO

S

bas

elin

e

Mea

n

(SD

)

Ho

NO

S

foll

ow

-up

Mea

n

(SD

)

Ch

ang

e

sco

re

Mea

n

(SD

)

Su

bje

cts

(%)

Ho

NO

S

bas

elin

e

Mea

n

(SD

)

Ho

NO

S

foll

ow

-up

Mea

n

(SD

)

Ch

ang

e

sco

re

Mea

n

(SD

)

Su

bje

cts

(%)

Ho

NO

S

bas

elin

e

Mea

n

(SD

)

Ho

NO

S

foll

ow

-up

Mea

n

(SD

)

Ch

ang

e

sco

re

Mea

n

(SD

)

[1]

ST

AB

LE

inm

ild

illn

ess

36

.05

.6(2

.8)

5.2

(2.8

)0

.4(1

.7)

7.2

8.9

(1.9

)8

.4(2

.0)

0.5

(1.8

)3

0.6

5.7

(2.8

)5

.3(2

.9)

0.4

(1.7

)

Inre

mis

sio

n,

%d

73

––

–3

8–

––

71

[2]

Rel

iab

lyim

pro

ved

inm

ild

illn

ess

0.6

9.6

(0.7

)0

.7(0

.8)

8.8

(0.9

)0

.11

0.0

(0.0

)2

.0(0

.0)

8.0

(0.0

)0

.59

.6(0

.7)

0.7

(0.8

)8

.8(0

.9)

[3]

Min

imal

lyim

pro

ved

inm

ild

illn

ess

7.4

7.6

(1.7

)2

.6(1

.7)

5.0

(1.0

)0

.78

.2(1

.5)

3.7

(1.8

)4

.5(0

.8)

6.2

7.6

(1.7

)2

.6(1

.7)

5.0

(1.0

)

[4]

Rel

iab

lyim

pro

ved

fro

m

sev

ere

tom

ild

illn

ess

5.8

18

.0(3

.5)

5.6

(2.6

)1

2.4

(4.0

)8

.01

9.2

(3.6

)7

.2(2

.5)

12

.0(3

.4)

6.2

18

.3(3

.6)

6.0

(2.7

)1

2.3

(3.8

)

[5]

Min

imal

lyim

pro

ved

fro

mse

ver

eto

mil

dil

lnes

s

2.8

14

.9(0

.9)

9.2

(1.0

)5

.8(1

.1)

4.5

15

.0(1

.1)

9.2

(1.0

)5

.8(1

.2)

3.1

14

.9(1

.0)

9.2

(1.0

)5

.8(1

.1)

[6]

Rel

iab

lyim

pro

ved

fro

m

mo

der

ate

tom

ild

illn

ess

2.0

12

.0(0

.8)

2.8

(1.5

)9

.2(1

.3)

0.2

13

.0(0

.0)

5.0

(0.0

)8

.0(0

.0)

1.7

12

.0(0

.8)

2.9

(1.5

)9

.2(1

.3)

[7]

Min

imal

lyim

pro

ved

fro

m

mo

der

ate

tom

ild

illn

ess

3.3

11

.9(0

.8)

6.6

(1.4

)5

.4(1

.2)

2.2

12

.4(0

.7)

7.3

(1.2

)5

.1(1

.0)

3.1

12

.0(0

.8)

6.7

(1.4

)5

.4(1

.2)

[8]

Rel

iab

lyim

pro

ved

fro

m

sev

ere

tom

od

erat

eil

lnes

s

1.1

22

.9(3

.2)

12

.0(0

.8)

10

.9(2

.8)

4.8

22

.5(2

.7)

12

.0(0

.8)

10

.5(2

.5)

1.8

22

.7(2

.9)

12

.0(0

.8)

10

.7(2

.6)

[9]

Min

imal

lyim

pro

ved

fro

m

sev

ere

tom

od

erat

eil

lnes

s

2.4

17

.3(1

.4)

11

.9(0

.8)

5.4

(1.1

)4

.81

7.5

(1.5

)1

2.0

(0.8

)5

.5(1

.1)

2.8

17

.4(1

.4)

11

.9(0

.8)

5.4

(1.0

)

[10

]R

elia

bly

imp

rov

edin

sev

ere

illn

ess

0.9

26

.4(3

.9)

16

.4(2

.7)

9.9

(2.5

)3

.72

6.4

(4.4

)1

5.7

(2.5

)1

0.7

(3.1

)1

.42

6.4

(4.1

)1

6.0

(2.6

)1

0.3

(2.8

)

[11

]M

inim

ally

imp

rov

edin

sev

ere

illn

ess

2.4

21

.8(2

.8)

16

.6(2

.4)

5.2

(1.2

)9

.42

2.5

(3.1

)1

7.1

(3.0

)5

.3(1

.1)

3.7

22

.1(3

.0)

16

.8(2

.7)

5.2

(1.1

)

[12

]S

TA

BL

Ein

mo

der

ate

illn

ess

8.8

11

.7(1

.2)

11

.6(1

.2)

0.2

(1.8

)1

1.2

11

.7(1

.3)

11

.8(1

.3)

-0

.2(2

.0)

9.3

11

.7(1

.3)

11

.6(1

.3)

0.1

(1.9

)

[13

]M

inim

ally

wo

rsen

edin

mil

d

illn

ess

2.6

3.0

(1.8

)7

.8(1

.8)

-4

.8(1

.0)

0.4

6.0

(0.0

)1

0.0

(0.0

)-

4.0

(0.0

)2

.23

.1(1

.8)

7.9

(1.8

)-

4.8

(1.0

)

[14

]R

elia

bly

wo

rsen

edin

mil

d

illn

ess

0.1

0.8

(1.0

)9

.3(1

.0)

-8

.5(1

.0)

0.0

0.0

(0.0

)0

.0(0

.0)

0.0

(0.0

)0

.10

.8(1

.0)

9.3

(1.0

)-

8.5

(1.0

)

[15

]M

inim

ally

wo

rsen

edfr

om

mil

dto

mo

der

ate

illn

ess

2.3

6.9

(1.4

)1

2.0

(0.7

)-

5.2

(1.0

)0

.77

.8(1

.2)

12

.8(0

.4)

-5

.0(1

.1)

2.0

6.9

(1.4

)1

2.1

(0.8

)-

5.2

(1.1

)

[16

]R

elia

bly

wo

rsen

edfr

om

mil

d

tom

od

erat

eil

lnes

s

0.8

3.4

(1.0

)1

2.1

(0.8

)-

8.8

(0.8

)0

.00

.0(0

.0)

0.0

(0.0

)0

.0(0

.0)

0.6

3.4

(1.0

)1

2.1

(0.8

)-

8.8

(0.8

)

[17

]M

inim

ally

wo

rsen

ed

fro

mm

od

erat

eto

sev

ere

illn

ess

1.6

11

.9(0

.9)

17

.2(1

.4)

-5

.2(1

.1)

2.5

11

.9(0

.7)

16

.8(1

.2)

-4

.9(1

.0)

1.7

11

.9(0

.8)

17

.1(1

.3)

-5

.1(1

.1)

[18

]R

elia

bly

wo

rsen

ed

fro

mm

od

erat

eto

sev

ere

illn

ess

0.4

12

.1(1

.0)

24

.2(3

.4)

-1

2.2

(3.2

)0

.41

2.7

(0.6

)2

2.0

(1.0

)-

9.3

(1.5

)0

.41

2.2

(0.9

)2

3.9

(3.3

)-

11

.7(3

.1)


123

Ta

ble

4co

nti

nu

ed

MO

clas

sifi

cati

on

—‘r

esea

rch

mo

del

’

Co

ho

rt1

aC

oh

ort

2b

Bo

thsi

tesc

Su

bje

cts

(%)

Ho

NO

S

bas

elin

e

Mea

n

(SD

)

Ho

NO

S

foll

ow

-up

Mea

n

(SD

)

Ch

ang

e

sco

re

Mea

n

(SD

)

Su

bje

cts

(%)

Ho

NO

S

bas

elin

e

Mea

n

(SD

)

Ho

NO

S

foll

ow

-up

Mea

n

(SD

)

Ch

ang

e

sco

re

Mea

n(S

D)

Su

bje

cts

(%)

Ho

NO

S

bas

elin

e

Mea

n

(SD

)

Ho

NO

S

foll

ow

-up

Mea

n

(SD

)

Ch

ang

e

sco

re

Mea

n

(SD

)

[19

]M

inim

ally

wo

rsen

edfr

om

mil

dto

sev

ere

illn

ess

1.3

8.9

(1.2

)1

4.9

(1.2

)-

6.0

(1.1

)0

.99

.3(1

.0)

14

.4(0

.5)

-5

.1(1

.2)

1.2

9.0

(1.1

)1

4.8

(1.1

)-

5.9

(1.1

)

[20

]R

elia

bly

wo

rsen

edfr

om

mil

dto

sev

ere

illn

ess

3.2

5.8

(2.7

)1

7.8

(3.4

)-

12

.0(3

.6)

1.6

7.8

(1.8

)1

8.1

(3.7

)-

10

.3(2

.9)

2.9

6.0

(2.7

)1

7.8

(3.4

)-

11

.8(3

.6)

[21

]S

TA

BL

Ein

sev

ere

illn

ess

12

.41

7.4

(3.5

)1

7.1

(3.4

)0

.3(1

.9)

31

.21

8.1

(3.6

)1

7.6

(3.6

)0

.5(1

.9)

15

.91

7.6

(3.5

)1

7.3

(3.5

)0

.4(1

.9)

[22

]M

inim

ally

wo

rsen

edin

sev

ere

illn

ess

1.2

18

.1(3

.8)

23

.4(3

.9)

-5

.3(1

.1)

3.7

16

.1(2

.3)

21

.3(2

.3)

-5

.2(1

.0)

1.7

17

.2(3

.4)

22

.5(3

.5)

-5

.3(1

.0)

[23

]R

elia

bly

wo

rsen

edin

sev

ere

illn

ess

0.6

16

.9(3

.3)

27

.7(4

.8)

-1

0.8

(3.1

)1

.61

6.6

(1.8

)2

6.8

(2.9

)-

10

.2(2

.2)

0.8

16

.8(2

.8)

27

.3(4

.2)

-1

0.6

(2.8

)

To

tal

10

0.0

10

.5(6

.2)

9.2

(6.1

)1

.3(5

.4)

10

0.0

16

.7(5

.4)

14

.3(5

.2)

2.4

(5.6

)1

00

.01

1.6

(6.5

)1

0.1

(6.3

)1

.5(5

.5)

Cla

ssifi

cati

on

of

the

stu

dy

sub

ject

sin

to2

3o

utc

om

eca

teg

ori

es(n

=4

,33

1)

aIt

alia

nR

OM

stu

dy

(n=

3,5

26

)b

Du

tch

RO

Mst

ud

y(n

=8

05

)c

Mer

ged

sam

ple

(n=

4,3

31

)d

Su

bg

rou

po

f‘‘

stab

lew

ith

inm

ild

illn

ess’

’


123

patients (categories 1–11 vs categories 12–23; v2 = 98.715,

df = 1, p \ 0.0001). The ‘clinical model’ gave very similar

results (Table 5): 67.5 % of the Italian patients had a posi-

tive outcome (categories 1–4) compared to only 46.2 % of

the Dutch patients (v2 = 128.198, df = 1, p \ 0.0001).

Almost half the Italian patients (46.7 %) were in mild illness

at follow-up (category 1) compared to only 8.4 % of the

Dutch patients. On the other hand, 41.9 % of the Dutch

patients were in critical clinical conditions at follow-up

(categories 7–8) compared to 20.7 % of the Italian patients

(v2 = 159.622, df = 1, p \ 0.0001). The mean MO index

(‘clinical model’) was 1.87 (SD 2.59) for cohort 1, and 0.16

(SD 2.33) for cohort 2. An independent t test gave a sig-

nificant difference between the cohorts in the MO index

(t = 18.39, df = 1,298.86, p \ 0.0001), again indicating a

better outcome for cohort 1.

Paired t tests and a repeated-measure ANOVA resulted in

significant improvement for both cohorts. We ran two sep-

arate t-test analyses which gave the following results. Cohort

1: meanD 1.3 (SD 5.4); t = 14.0 (df = 3,525), p \ 0.0001.

Cohort 2: meanD 2.4 (SD 5.6); t = 12.2 (df = 804),

p \ 0.0001. The mixed-group factorial ANOVA (sphericity

assumed) showed an interaction between time and group:

F (1.4329) = 28.24; MSE = 14.99 (p \ 0.0001). As

expected, both cohort had lower HoNOS ratings at follow-

up (improvement), but cohort 2 showed greater improve-

ment: F (1.4329) = 738.37; MSE = 56.65 (p \ 0.0001).

For cohort 1, the group ES was 0.20, mean ES-ind 0.20

(SD 0.85). The correlation between the ordinal variables

classifying outcome through the MO model (MO index)

and ES-ind was significant at the 0.0001 level (two-tailed)

with q = 0.21. For cohort 2, the group ES was 0.38, mean

ES-ind 0.38 (SD 0.88). The correlation between the MO

index and ES-ind was significant at the 0.0001 level (two-

tailed) with q = 0.66.

RTM analysis

As shown in Table 6, there was evidence of a RTM effect

on outcome: low baseline scores tended to increase while

Fig. 1 Plot of the combined model for the interpretation of meaningful outcome (MO): ‘research model’ (4,331 cases)


123

high scores tended to drop. Among the subjects with an

average baseline score (stratum 1) those of cohort 1

showed a greater change than cohort 2.

Estimation of the RTM effect with Barnett’s formula

gave the following results: RTM = 0.58 (cohort 1);

RTM = 1.59 (cohort 2). As the mean baseline score of

cohort 1 was below the ‘true mean’ while it was above

for cohort 2, the RTM had opposite effects across the

two cohorts. Thus, the estimated mean change scores

corrected for RTM were respectively 1.3 ? 0.58 = 1.88

and 2.4 - 1.59 = 0.81.

The non-equivalent group analysis involved two

ANCOVA models. The first estimated the average differ-

ence in outcome between the two cohorts adjusted for

baseline severity. There was a significant advantage for

cohort 1 in terms of outcome, with a mean difference

between the two cohorts of 1.51 points on HoNOS total

score (CI 95 % 1.11–1.91; p \ 0.0001). The second anal-

ysis estimated the average difference between the two

cohorts adjusted for baseline severity and for measurement

error (reliability-adjusted ANCOVA). There was further

gain in the advantage of cohort 1, with a mean difference of

2.11 points (CI 95 % 1.72–2.50).

Discussion

The substantial heterogeneity between the two study cohorts

created the opportunity to find criteria for MO generalizable

to a broad range of service users. We addressed the issue of

generalizability by taking account of both between- and

within-country variability. In particular, we combined Italian

and Dutch data in order to simulate the presence of specific

patients’ subgroups within a large and heterogeneous patient

population. The procedure for identifying common criteria

across different language versions of HoNOS is original and

could be used for other rating scales (with the same calcu-

lation tables we generated).

The MO model was presented in two versions, the first

generating a more complete and complex classification (the

Fig. 2 Plot of the combined model for the interpretation of meaningful outcome (MO): ‘clinical model’ (4,331 cases)


123

Ta

ble

5C

om

bin

edm

od

elfo

rth

ein

terp

reta

tio

no

fm

ean

ing

ful

ou

tco

me

(MO

):si

mp

lifi

edcl

assi

fica

tio

nfo

rcl

inic

alu

se

MO

clas

sifi

cati

on

—‘c

lin

ical

mo

del

’

Co

ho

rt1

aC

oh

ort

2b

Bo

thsi

tesc

Su

bje

cts

(%)

Ho

NO

S

bas

elin

e

Mea

n(S

D)

Ho

NO

S

foll

ow

-up

Mea

n(S

D)

Ch

ang

e

sco

re

Mea

n(S

D)

Su

bje

cts

(%)

Ho

NO

S

bas

elin

e

Mea

n(S

D)

Ho

NO

S

foll

ow

-up

Mea

n(S

D)

Ch

ang

e

sco

re

Mea

n(S

D)

Su

bje

cts

(%)

Ho

NO

S

bas

elin

e

Mea

n(S

D)

Ho

NO

S

foll

ow

-up

Mea

n(S

D)

Ch

ang

e

sco

re

Mea

n(S

D)

[1]

Mil

dil

lnes

s4

6.7

5.7

(2.7

)4

.8(2

.9)

0.9

(2.9

)8

.48

.1(1

.8)

7.6

(2.4

)0

.6(2

.5)

39

.65

.8(2

.7)

4.9

(2.9

)0

.9(2

.9)

[2]

Imp

rov

emen

tto

mil

dil

lnes

s1

4.0

15

.1(3

.6)

6.1

(2.7

)8

.9(4

.2)

14

.91

6.8

(3.8

)1

2.0

(0.8

)4

.8(4

.1)

14

.21

5.4

(3.7

)6

.4(2

.7)

8.9

(4.2

)

[3]

Imp

rov

emen

tto

mo

der

ate

illn

ess

3.5

19

.1(3

.4)

11

.9(0

.8)

7.2

(3.1

)9

.72

0.0

(3.3

)1

2.0

(0.8

)8

.0(3

.1)

4.6

19

.5(3

.4)

12

.0(0

.8)

7.5

(3.2

)

[4]

Imp

rov

emen

tw

ith

inse

ver

e

illn

ess

3.3

23

.0(3

.7)

16

.5(2

.5)

6.4

(2.7

)1

3.2

23

.6(3

.9)

16

.7(2

.9)

6.8

(3.1

)5

.12

3.3

(3.8

)1

6.6

(2.7

)6

.6(2

.9)

[5]

Sta

bil

ity

inm

od

erat

eil

lnes

s8

.81

2.2

(1.7

)1

2.0

(1.6

)0

.2(1

.9)

11

.21

2.5

(1.8

)1

2.5

(1.7

)0

.0(1

.9)

9.3

12

.3(1

.7)

12

.1(1

.7)

0.2

(1.9

)

[6]

Wo

rsen

ing

tom

od

erat

eil

lnes

s3

.06

.0(2

.0)

12

.1(0

.7)

-6

.1(1

.8)

0.7

7.8

(1.2

)1

2.8

(0.4

)-

5.0

(1.1

)2

.66

.1(2

.0)

12

.1(0

.8)

-6

.0(1

.8)

[7]

Sta

bil

ity

inse

ver

eil

lnes

s1

2.4

18

.5(3

.4)

18

.1(3

.2)

0.3

(1.9

)3

1.2

19

.1(3

.4)

18

.6(3

.4)

0.5

(1.9

)1

5.9

18

.7(3

.4)

18

.3(3

.3)

0.4

(1.9

)

[8]

Wo

rsen

ing

in/t

ose

ver

eil

lnes

s8

.31

0.3

(5.2

)1

9.1

(4.8

)-

8.7

(4.1

)1

0.7

13

.3(3

.7)

20

.1(4

.3)

-6

.8(2

.9)

8.7

11

.0(5

.0)

19

.3(4

.7)

-8

.3(4

.0)

To

tal

10

0.0

10

.5(6

.2)

9.2

(6.1

)1

.3(5

.4)

10

0.0

16

.7(5

.4)

14

.3(5

.2)

2.4

(5.6

)1

00

.01

1.6

(6.5

)1

0.1

(6.3

)1

.5(5

.5)

Cla

ssifi

cati

on

of

the

stu

dy

sub

ject

sin

toei

gh

to

utc

om

eca

teg

ori

es(4

,33

1ca

ses)

aIt

alia

nR

OM

stu

dy

(n=

3,5

26

)b

Du

tch

RO

Mst

ud

y(n

=8

05

)c

Mer

ged

sam

ple

(n=

4,3

31

)d

Su

bg

rou

po

f‘‘

stab

lew

ith

inm

ild

illn

ess’

’


123

‘research model’) and the second generating a shorter and

workable classification (the ‘clinical model’). We com-

pared the results of both models with those calculated using

consolidated approaches for interpreting change. Differ-

ently from previous work, in the present paper we focused

more on ‘outcome’ than on ‘change’ and we took also

account of minimally changed and unchanged subjects [30,

32]. The combined model we proposed is thus the result of

a combination of two criteria for adequate change (SEM/

RC), two criteria for clinically significant change (cut-offs

1 and 2), and two criteria for classifying severity within

clinical stability.

The calculated criteria for adequate change are highly

consistent with previous findings [22, 30, 32]. Thus, the

study adds further evidence in favour of the use of a

change of at least 4 points on HoNOS for minimally

detectable change and of at least 8 points for reliable

change which corresponded to a medium and large effect

size. In the ‘research model’ we proposed two levels of

change. We confirmed that a change of at least 8 points is

needed for an individual reliable change [30, 32] but we

proposed also using a change of at least 4 points to detect

minimal change. This was done to achieve a more com-

plete classification and to introduce a middle ground

between the considerable change needed for the RC cri-

terion and the no-change status. In the ‘clinical model’

we opted for a simplification of the classification system.

Thus, the 23 outcome categories were collapsed into

eight. The patients who showed a change between 4 and 8

points (i.e. the ‘minimally changed’ subjects) were

included in the category of ‘meaningful change’ and not

in the category of ‘no change’ as in previous work [30,

32]. This choice was mainly due to the high proportion of

unchanged subjects (56 %). As a result, in the ‘clinical

model’ only the SEM criterion (which obviously included

also the RC criterion) was adopted to identify meaningful

change.

The outcome of cohort 1, assessed through the MO

models, was better than that of cohort 2. This was to be

expected, since the Italian patients came from a prevalence

cohort receiving regular outpatient services whereas the

Dutch patients were referred to ACT teams and selected

based on their severity and the lack of motivation for

regular treatment. The classical approaches for outcome

evaluation, however, gave opposite results and showed

greater improvement for cohort 2. This was largely

explained through the RTM, which showed a large oppo-

site effect across the two cohorts. The two ANCOVAs, in

fact, reversed the results again showing outcome advanta-

ges for cohort 1 of 1.5 and of 2.1 points.

RTM is a statistical phenomenon that occurs when

repeated measurements are made on the same subjects. It

happens because values are observed with random error,

because sampling is subject to selection and because of the

unreliability of the measures. In our cohorts it was mostly

due to a selection bias which is common in observational

studies and difficult to get around. RTM is a critical issue in

ROM where it is hardly practical to use complex statistical

procedures to adjust for it. The proposed approach for

change interpretation is very simple and gave results com-

parable to those of ANCOVA. Unlike the other reference

methods, it was scarcely influenced by the RTM effect. In

cohort 1, the great majority of the unchanged subjects

(69 %) were reclassified in a positive outcome category

(mild illness). In cohort 2, on the contrary, most of them

(83 %) were reclassified in two negative categories (stable

in moderate or in severe illness). This led the outcome of

cohort 1 to be better than that of cohort 2. The compensation

introduced by this reclassification is opposite and propor-

tional to the RTM. Thus, we believe that the scarce influ-

ence of the RTM on our model was a result of it.

Besides this important issue, there is another problem in

using the classical approaches for evaluating group-level

outcome in community mental health care. Within the

Table 6 HoNOS change scores in subjects stratified based on their baseline score (4,331 cases)

Strata Cohort 1a Cohort 2b

Mean SD no. % Mean SD no. %

1c -1.4722 4.26663 831 23.6 -2.5000 4.81070 8 1.0

2D 1.4519 4.96417 2,295 65.1 0.6436 4.82138 505 62.7

3e 6.0606 6.60195 400 11.3 5.6336 5.43875 292 36.3

Total 1.2856 5.44567 3,526 100.0 2.4224 5.60832 805 100.0

The mean and standard deviation (SD) of the aggregate sample (i.e. both sites) were used for stratificationa Italian ROM studyb Dutch ROM studyc Low \ (mean - 1 SD)d (Mean - 1 SD) \ Average \ (mean ? 1 SD)e High [ (mean ? 1 SD)


123

wider groups of routine patients there are substantial sub-

groups suffering from chronic conditions for whom the

maintenance of a non-severe, stable health status remains a

valid therapeutic goal [8]. In such cases the statistical

significance of improvement can only be a function of

sample size and not of change size. On the other hand, the

more conservative methods based on individual meaning-

ful difference (i.e. RC, SEM or ES) inevitably show no

measurable change in those subgroups and are thus

impossible to interpret (as classification is lacking). The

advocated MO approach gave a better description of the

outcomes assessed by HoNOS and allowed a more ade-

quate comparison of the two quite different cohorts of

patients.

Limitations

The present study suffers several limitations. First,

although it is acknowledged by the Australian Mental

Health Outcomes & Classification Network the classify-

and-count approach adopted is based on an arbitrary clas-

sification [6, 30]. However, the two resulting CS cut-offs

are consistent with previous results [30, 32]. In particular,

the cut-off of 10 separating moderate from mild illness is

consistent with the results of a recent study that calculated

the CS threshold through a more valid and acceptable

approach [32]. Thus, our results can anyway be considered

valid and generalizable.

Second, the positions of the three subgroups of stable

patients in the classification hierarchy are also the result of

an arbitrary choice. Third, there are no reference data on

HoNOS rating scores, i.e. how severely mentally ill

patients progress over time. So the reference (whether the

outcome is positive or not) currently involves the clini-

cian’s judgment and the patient’s treatment goals. Fourth,

the MO approach, as conceived in this study, is only

applicable to HoNOS total scores. Thus, information on

changes in the individual items or subscales is lost.

Fifth, the patients to whom the model was applied were

all outpatients. It is therefore unknown whether our find-

ings could be applicable to inpatients or to residential

services. In the first case, as patients in community-based

mental health services are usually admitted to hospital in

the acute phase we would expect to select subjects at their

worst. Their margin for improvement should therefore be

wider and the RTM dramatically more evident. In the

second case, on the contrary, we would expect to recruit

stabilized patients whose clinical assessment would likely

be affected by floor effect. In both cases the application of

our model could produce different results.

Sixth, as we aggregated data across two very different

clinical samples, from different countries, with different

outcome measurement traditions, and as the non-equivalent

group analysis approach could deal only with a defined set

of non-equivalences, we are not able to know to what

extent our model is able to cope with the differences we

could not account for.

Implications

It is not sufficient to look at the ES or at the change scores for

proper interpretation of pre-post HoNOS data. The proposed

approach proved valuable for interpreting outcome on the

group level. It is simple and practical, so, it could be adopted

for routine outcome evaluation and management. On the

patient level the ‘clinical model’ could be used to measure

and monitor treatment effect and guide progress. It could also

serve as feedback information for clinicians and patients.

Clinicians could be trained to use this evidence-based

information to take clinical decisions, to help decide whether

to continue a treatment as it is or even to end it if appropriate

(when a patient shows prolonged stability in mild illness).

When the outcome is negative (stable in severe illness) cli-

nicians could consider changing the treatment plan [47]. This

very simple and practical approach may not only help in

tailoring interventions, but may also improve communication

between patients and clinicians and discussion about treat-

ment and its progress [48].

Conflict of interest The authors declare no conflict of interest.

Appendix

The standard error of a measurement (SEM) was calculated

as:SEM ¼ SD1�pð1� aÞ (where SD1 is the standard

deviation of the baseline observations and a is Cronbach’s

coefficient).

SEM difference = sqrt (2 9 (SEM 9 SEM))

The RC index was calculated as:

RC 90 % = 1.65 9 SEM diff

Effect size (ES): Cohen’s d was calculated as the dif-

ference between two means divided by a pooled standard

deviation.

ES ¼ �x1 � �x2

Spooled

The original standard deviations for the two means

(baseline = 1; follow-up = 2) are used to compute the

pooled standard deviation.

Spooled ¼p ðn1 � 1Þs2

1 þ ðn2 � 1Þs22

n1 þ n2

Cohen’s rule of thumb was used to interpret results:

ES\0:2! no change; 0:5\ES\0:8!moderate change;


123

0:2\ES\0:5! small change; ES [ 0:8! great change:

We considered: �x1� �x2 ¼DES(where D is the degree of

change needed to classify change as small, moderate or great

according to ES) and we calculated:

DESsmallchange ¼ 0:2� Spooled

DESmoderatechange ¼ 0:5� Spooled

DESgreatchange ¼ 0:8� Spooled

The individual effect size was calculated as the

difference between baseline and follow-up scores of each

individual divided by the pooled standard deviation.

ES ¼ x1 � x2

Spooled

According to the results of this calculation subjects were

classified as ‘‘unchanged’’ (indES \ 0.2); ‘‘weakly

changed’’ (0.2 \ indES \ 0.5). ‘‘moderately changed’’

(0.5 \ indES \ 0.8), and ‘‘greatly changed’’ (indES [0.8). This classification was applied in the directions of

improvement and worsening.

The CS cut-off for clinical significance was calculated

as:

CS cut - off ¼ meanclin � SDnormð Þ þ meannorm � SDclinð ÞSDnorm + SDclin

(where meanclin and meannorm are the mean scores of the

‘dysfunctional population’ and the ‘functional population’,

respectively, and SDnorm and SDclin are the standard

deviations of the scores in these two groups)

RTM effect ¼ r2wpðr2wþ r2bÞCðzÞ ¼ rtð1� qÞCðzÞ;

� 1� q� 1

The total variance is r2t ¼ r2

w þ r2b; the within-subject

variance is r2w ¼ ð1� qÞ r2

t , the between-subject variance

is r2b ¼ qr2

t and q is the correlation between the two

measurements.

CðzÞ ¼ /ðzÞ=UðzÞ

The terms /(z) and U(z) are respectively the probability

density and the cumulative distribution functions of the

standard normal distribution. z = (C - l)/rt if the subjects

are selected using a baseline measurement greater than C,

and z = (l - C)/rt if the subjects are selected using a

baseline measurement less than C; l is the population

mean.1

References

1. Department of Health (2001) Mental health information strategy.

Department of Health, London

2. Commonwealth Department of Health and Ageing (2002)

National outcome and casemix collection: overview of clinical

measures and data items. Commonwealth Department of Health

and Ageing, Canberra

3. Ministry of Health (2005) National mental health information

strategy 2005–2010. Ministry of Health, Wellington

4. Trauer T (2010) Outcome measurement around the world. In:

Trauer T (ed) Outcome measurement in mental health. Cam-

bridge University Press, Cambridge, pp 13–101

5. Norman GR, Sridhar FG, Guyatt GH (2001) Relation of distri-

bution- and anchor-based approaches in interpretation of changes

in health-related quality of life. Med Care 39:1039–1047

6. Burgess P, Pirkis J (2008) Key performance indicators for Aus-

tralian Public Mental Health Services—potential contributions of

MH-NOCC data: developing indicators of effectiveness Version

2.0, Brisbane, Queensland

7. Eisen SV, Ranganathan G, Seal P, Spiro A 3rd (2007) Measuring

clinically meaningful change following mental health treatment.

J Behav Health Serv Res 34:272–289

8. Trauer T (2010) Assessment of change in outcome measurement.

In: Trauer T (ed) Outcome measurement in mental health.

Cambridge University Press, Cambridge, pp 206–218

9. Speer DC (1999) What is the role of two-wave designs in clinical

research? Comment on Hageman and Arrindell. Behav Res Ther

37:1203–1210

10. Cella D, Bullinger M, Scott C, Barofsky I et al (2002) Group vs

individual approaches to understanding the clinical significance

of differences or changes in quality of life. Mayo Clin Proc

77:384–392

11. Lydick E, Epstein RS (1993) Interpretation of quality of life

changes. Qual Life Res 2:221–226

12. Crosby RD, Kolotkin RL, Williams GR (2003) Defining clini-

cally meaningful change in health-related quality of life. J Clin

Epidemiol 56:395–407

13. Jacobson NS, Follette WC, Revenstorf D, Baucom DH et al

(1984) Variability in outcome and clinical significance of

behavioral marital therapy: a reanalysis of outcome data. J Con-

sult Clin Psychol 52:497–504

14. Christensen L, Mendoza JL (1986) A method of assessing change

in a single subject: an alteration of the RC index. Behav Therapy

17:305–308

15. Jacobson NS, Truax P (1991) Clinical significance: a statistical

approach to defining meaningful change in psychotherapy

research. J Consult Clin Psychol 59:12–19

16. Hageman WJ, Arrindell WA (1993) A further refinement of the

reliable change (RC) index by improving the pre-post difference

score: introducing RCID. Behav Res Ther 31:693–700

17. Evans C, Margison F, Barkham M (1998) The contribution of

reliable and clinically significant change methods to evidence-

based mental health. Evid Based Ment Health 1:70–72

18. Hageman WJ, Arrindell WA (1999) Establishing clinically sig-

nificant change: increment of precision and the distinction

between individual and group level of analysis. Behav Res Ther

37:1169–1193

19. Hsu LM (1999) A comparison of three methods of identifying

reliable and clinically significant client changes: commentary on

Hageman and Arrindell. Behav Res Ther 37:1195–1202

20. Hageman WJ, Arrindell WA (1999) Clinically significant and

practical! Enhancing precision does make a difference. Reply to

McGlinchey and Jacobson, Hsu, and Speer. Behav Res Ther

37:1219–1233

1 The population mean should be a known true mean in a population.

In the present study, the mean of the aggregate sample (i.e. both sites)

was used as an estimate of l.


123

21. McGlinchey JB, Atkins DC, Jacobson NS (2002) Clinical sig-

nificance methods: which one to use and how useful are they?

Behav Ther 33:529–550

22. Burgess P, Pirkis J, Coombs T (2009) Modelling candidate

effectiveness indicators for mental health services. Aust N Z J

Psychiatry 43:531–538

23. Cohen J (1969) Statistical power analysis for the behavioural

sciences. Academic Press, London

24. Kazis LE, Anderson JJ, Meenan RF (1989) Effect sizes for

interpreting changes in health status. Med Care 27(Suppl

3):178–189

25. McHorney CA, Tarlov AR (1995) Individual-patient monitoring

in clinical practice: are available health status surveys adequate?

Qual Life Res 4:293–307

26. Grundy CT, Lambert MJ, Grundy EM (1996) Assessing clinical

significance: application to the Hamilton Rating Scale for

depression. J Ment Health 5:25–33

27. Hafkenscheid A (2000) Psychometric measures of individual

change: an empirical comparison with the Brief Psychiatric

Rating Scale (BPRS). Acta Psychiatr Scand 101:235–242

28. Audin K, Margison FR, Clark JM, Barkham M (2001) Value of

HoNOS in assessing patient change in NHS psychotherapy and

psychological treatment services. Br J Psychiatry 178:561–566

29. Matthey S (2004) Calculating clinically significant change in

postnatal depression studies using the Edinburgh Postnatal

Depression Scale. J Affect Disord 78:269–272

30. Parabiaghi A, Barbato A, D’Avanzo B, Erlicher A et al (2005)

Assessing reliable and clinically significant change on Health of

the Nation Outcome Scales: method for displaying longitudinal

data. Aust N Z J Psychiatry 39:719–725

31. Barbato A, Parabiaghi A, Panicali F, Battino N et al (2011) Do

patients improve after short psychiatric admission? A cohort

study in Italy. Nord J Psychiatry 65:251–258

32. Parabiaghi A, Rapisarda F, D’Avanzo B, Erlicher A et al (2011)

Measuring clinical change in routine mental health care: differ-

ences between first time and longer term service users. Aust N Z J

Psychiatry 45:558–568

33. Bauer S, Lambert MJ, Nielsen SL (2003) Clinical significance

methods: a comparison of statistical techniques. J Pers Assess

82:60–70

34. Speer DC (1992) Clinically significant change: jacobson & Truax

(1991) revisited. J Consult Clin Psych 60:402–408

35. Kortrijk HE, Staring AB, van Baars AW, Mulder CL (2010)

Involuntary admission may support treatment outcome and

motivation in patients receiving assertive community treatment.

Soc Psychiatry Psychiatr Epidemiol 45:245–252

36. Kortrijk HE, Mulder CL, Drukker M, Wiersma D, Duivenvoor-

den HJ (2012) Duration of assertive community treatment and the

interpretation of routine outcome data. Aust N Z J Psychiatry

46:240–248

37. Wing JK, Beevor AS, Curtis RH, Park SB, Hadden S, Burns A

(1998) Health of the Nation Outcome Scales (HoNOS). Research

and development. Br J Psychiatry 172:11–18

38. Pirkis JE, Burgess PM, Kirk PK, Dodson S, Coombs TJ, Wil-

liamson MK (2005) A review of the psychometric properties of

the Health of the Nation Outcome Scales (HoNOS) family of

measures. Health Qual Life Outcomes 3:76

39. Rossi R, Blaco R, Castelli C, Civenti G et al (1999) Il costo dei

pazienti psichiatrici per classi di gravita. Epidemiol Psichiatr Soc

7:198–208

40. Lora A, Bai G, Bianchi S, Bolongaro G et al (2001) La versione

italiana della HoNOS (Health of the Nation Outcome Scales), una

scala per la valutazione della gravita e dell’esito nei servizi di

salute mentale. Epidemiol Psichiatr Soc 10:198–212

41. Mulder CL, Staring ABP, Loos J, Buwalda V et al (2004) De

Health of the Nations Outcome Scales (HoNOS) als instrument

voor routine outcome assessment. Tijdschr Psychiatr 46:273–285

42. Lelliott P (1999) Definition of severe mental illness. In: Charl-

wood P, Mason A, Goldacre M, Cleary R, Wilkinson E (eds)

Health outcome indicators: severe mental illness. Report of a

working group to the Department of Health. National Centre for

Health Outcomes Development, Oxford, pp 87–93

43. Barnett AG, Van Der Pols JC, Dobson AJ (2005) Regression to

the mean: what it is and how to deal with it. Int J Epidemiol

34:215–220

44. Gardner MJ, Heady JA (1973) Some effects of within-person

variability. J Chronic Dis 26:781–795

45. Davis CE (1976) The effect of regression to the mean in epide-

miologic and clinical studies. Am J Epidemiol 104:493–498

46. http://www.socialresearchmethods.net/kb/regrmean.php. Acces-

sed on April 2013

47. Harmon C, Hawkins EJ, Lambert MJ, Slade K, Whipple JS

(2005) Improving outcomes for poorly responding clients: the use

of clinical support tools and feedback to clients. J Clin Psychol

61:175–185

48. Drukker M, Bak M, a Campo J, Driessen G, van Os J, Delespaul

P (2010) The cumulative needs for care monitor: a unique

monitoring system in the south of the Netherlands. Soc Psychi-

atry Psychiatr Epidemiol 45:475–485


123

http://www.socialresearchmethods.net/kb/regrmean.php

defining multiple criteria for meaningful outcome in routine outcome measurement using the health of...

Documents