1
Reliability of Three-Dimensional Facial Landmarks Using Multivariate Intraclass Correlation
Abstract The intraclass correlation coefficient (ICC) is widely used in many fields, including orthodontics,
as a measure of reliability of quantitative data. The main goals of this study were to determine
the multivariate intraclass correlation, a measure of intra-rater reliability, for the selection of 6
3D mid-facial soft tissue landmarks and compare the multivariate ICC with other measures of
landmark reliability. 3D stereophotogrammetric images of 15 randomly selected subjects were
landmarked twice by the same rater using 3dMD software. The multivariate ICC was found to
be as high or higher than the individual coordinate measures in each of the landmarks
examined in this study. The reliability of all six soft-tissue landmarks in this study was excellent
(ICC >0.98). When the within-coordinate and Euclidean distance criteria were applied, the
reliability of the landmarks was found to be acceptable. However, the two selections of the
landmark pronasion were found to be significantly different in the X coordinate. The
multivariate ICC identified this same landmark as its lowest estimate of ICC. This indicates that
the multivariate ICC may be important in quantifying reliability and warrants investigation in
future craniofacial landmark reliability studies.
Introduction The intraclass correlation coefficient is widely used in many fields, including orthodontics, as a
measure of reliability of quantitative data. With the availability of technical software capable of
2
recording two and three-dimensional landmark coordinate data for cone beam computerized
tomography (CBCT) and digital cephalometric data, the intraclass correlation coefficient is used
to establish intra- and inter-rater reliability or reliability of landmarks across software or
machines. Recent applications of the intraclass correlation to assess reliability of landmark
placement have been used in the study of soft tissue landmarks using CBCT (Fourie, 2010), hard
tissue landmarks using CBCT (Lagravere, 2009), and cephalometry (Chien, 2009). The most
commonly used method of reporting reliability is a summary measure of the distribution (mean
or range) of intraclass correlation coefficients calculated within each coordinate. The estimates
of the intraclass correlation coefficient are often obtained using a one-way or two-way random
effects ANOVA model (Fisher, 1958; Haggard, 1958; Shrout and Fleiss, 1979) from which
confidence intervals and significance values can also be computed.
However, statistical methods for calculating a multivariate estimate of intraclass correlation
exist and have been used extensively in the fields of biology, epidemiology, and genetics to
estimate the degree of resemblance between members of a family with respect to several
characteristics (Mian, 1997). In the study of reliability of craniofacial landmark selection, an
estimate of multivariate intraclass correlation would represent how closely a “family” of
measures (usually 2 measures) resemble each other with respect to the X, Y, and Z coordinate
information. The unified estimate across coordinates enable a global measure, rather than a
summary of multiple estimates, on which to assess reliability of landmark selection.
A measure of multivariate intraclass correlation was first introduced by Rao (1945, 1953) in
which asymptotic distribution and methods of significance testing were established in the case
3
that each group had the same number of members. This early work was expanded by Donner
and Koval (1980) who proposed an analysis of variance estimator. Srivastava (1984) developed
non-iterative methods based on weighted sums of squares to obtain unbiased estimators of
covariance matrices for a single characteristic and their asymptotic properties for the case in
which groups have a different number of members. Other estimates based on maximum
likelihood methods have been proposed (Rosner, 1977) , however it has been shown through
simulation studies that inferences based on non-iterative methods perform as favorable or
more favorable than methods based on maximum likelihood estimators (Rosner, 1979; Konishi,
1982, 1985; Donner and Eliasziw, 1988). Non-iterative approaches are free of the possible
pitfalls of methods of estimation based on the maximization of implicit functions of multiple
parameters, which are sensitive to local extrema and not guaranteed to converge. Konishi and
Khatri (1991) generalized Srivastava’s estimator and the Donner and Koval (1980) estimator to
the multivariate situation of more than one characteristic and proposed a unified estimator
based on the maximum canonical correlation (interclass) and eigenvalue (intraclass) of the
covariance matrix as measures of degree of resemblance (Konishi, 1991). This method assumes
multivariate normal distribution for asymptotic properties of the estimates.
The main goals of this study were to determine the multivariate intraclass correlation, a
measure of intra-rater reliability, for the selection of 6 three-dimensional mid-facial soft tissue
landmarks (nasion, pronasion, right and left alare, labiale superius, labiale inferius, and
subnasale (used for superimposition) see Figure 1), and compare the multivariate intraclass
correlation coefficient with other measures of landmark agreement. This study is innovative in
the application of these methods of calculating the estimate of multivariate intraclass
4
correlation coefficient to assess the reliability of 3D landmark selection and in the comparison
of this estimate to conventional within-coordinate intraclass correlation coefficients.
Materials and Methods Three-dimensional stereophotogrammetric images of 15 randomly selected subjects were
landmarked using 3dMD software (Atlanta, GA) and coordinates of each landmark were
selected twice by the same rater with approximately 2 weeks between landmark selections. Six
facial landmarks were selected for this study because they represent the nasal and mouth area
of the face which are pronounced three- dimensional facial features. Coordinates were
exported and the sets of landmarks were superimposed based on translation of each set of first
subnasale landmark coordinates to the origin. The sets of landmarks were then standardized
with respect to rotation (X into Y holding Z constant) and facial tilt (Y into Z holding X constant)
using rotation matrices with the coordinates of the first measure of nasion located on the Y-axis
for all subjects after adjustment (Lele and Richtsmeier, 2001). Standardization of head position
and coordinate location was not done in the imaging process, but is important in assessing
reliability to accurately reflect the variation of each landmark; which impacts the intraclass
correlation coefficient. Therefore the head position was standardized using rotation matrices
after the coordinates were exported. Conventional within-coordinate intraclass correlation
coefficients, confidence intervals, and significance probabilities, obtained from a one-way
ANOVA model, were computed for the X, Y, and Z planes for each landmark. Since the intraclass
correlation coefficient is a parametric procedure, the residuals from each model were
examined for normality using the Shapiro-Wilk test. Only left alare in the Y coordinate was
there evidence of non normality (Shapiro-Wilk p=0.0172). However since the one-way ANOVA
5
model is generally robust against minor departures from normality, the use of ICC for this
coordinate was deemed appropriate. The intraclass correlation coefficient gives the proportion
of variance attributable to between-group differences, and the null hypothesis for significance
testing is that this coefficient is equal to zero. The ICC ranges from 0 to 1, with 1 indicating
perfect agreement. A commonly adopted minimum acceptable univariate intraclass correlation
coefficient is 0.80 (Shrout and Fleiss, 1979), with 0.90 generally considered excellent
agreement. However, since the estimate of the intraclass coefficient is subject to sampling
error, Lee (1989) proposes that the lower bound of the 95% confidence interval of the estimate
be at least 0.75. A method of multivariate intraclass correlation (Konishi, 1991) was used to
determine the level of agreement across X, Y, and Z coordinates. Distributional assumptions
were validated using the Shapiro-Wilk test within each dimension. This multivariate method, as
discussed above, produces a unified estimator of intra-rater agreement between the dual
landmark selections with respect to X, Y, and Z coordinate information.
Lee (1989) proposes a three-pronged approach for establishing acceptable agreement. In
addition to a 1) lower bound of the intraclass correlation coefficient 95% confidence interval of
at least 0.75, there must be 2) no systematic bias between the measures, and 3) no significant
differences between the measures. Estimation of intraclass correlation does not distinguish
between systematic and random error. To identify any possible bias and detect significant
differences in the sets of landmarks, the difference was calculated for each set of landmark
coordinates (first-second) within each plane. The normality of each within-coordinate
difference variable was assessed using the Shapiro-Wilk test under the null hypothesis that the
distribution is normal. Difference variables shown to be normal were tested using the paired t-
6
test under the null hypothesis that the mean difference between the two measures was equal
to zero. For variables in which the distributions of the difference were shown to be non-
normal, a Wilcoxon Signed-Rank test was used to determine if the median difference between
the coordinates from the two landmarks was equal to zero (assuming symmetry).
To assess differences in landmark placement that may be of clinical importance, the Euclidean
distance between the two sets of 3D coordinates was calculated. Descriptive statistics are
given to summarize the actual distance in space between the two sets of landmarks.
Calculation of the multivariate intraclass correlation coefficient is not available in standard
commercially available statistical software. An R program was created by the author to
perform the calculations of the estimates for the multivariate ICC reported in this study and is
available upon request.
Analysis was done using SAS Enterprise Guide 4.2 and R statistical software, with a specified
Type I error of 0.05.
Results Within-coordinate reliability analysis was performed on the selection of these six soft tissue
landmarks. The intraclass correlations coefficients for the landmarks are shown in Table 1. The
within-coordinate coefficients ranged from 0.9841 in the Y coordinate of left alare, to 0.9999 in
the Y coordinate of nasion and the X coordinate of right alare. No lower bounds of the 95%
confidence intervals were below the 0.75 standard proposed by Lee; with all lower bounds
>0.95. In addition, all coordinates and all landmarks had intraclass coefficients significantly
greater than zero (p<0.0001 for all). The mean ICC for all coordinates within a landmark was
7
calculated, this value ranged from 0.9912 in left alare to 0.9997 in nasion. It is worth noting
here that the first measure of nasion was a landmark used to standardize the rotation and tilt of
the set of facial landmarks.
The multivariate intraclass correlation (Konishi, 1990), encompassing information for all three
dimensions, was calculated for each landmark. These unified landmark values ranged from
0.9991 for pronasion to 0.9999 for nasion, right and left alare.
The difference between the first and second landmark selections (first – second) were
calculated for each dimension and each landmark. These differences were tested using the
Student’s t test or Wilcoxon signed rank test to determine if mean or median respectively was
equal to zero. The p-values for these tests are given in Table 2. A p value <0.05 indicates that
there is a significant difference in the two landmark selections in that coordinate. A significant
difference occurred between the two selections of pronasion in the X coordinate with
significant differences (mean=-0.22mm, t-test p=0.0014). Note that no adjustment was made
for multiple comparisons.
The Euclidean distance (in mm) was calculated for each landmark to assess the actual distance
between the two landmark selections in three-dimensional space. The summary of the
Euclidean distances is given in Table 3. The largest mean Euclidean distances were found to
exist in the left alare landmark (mean=0.84mm, st.dev=0.73mm) followed by the labiale
superius landmark (mean=0.78mm, st.dev=0.51mm). Pronasion (mean=0.49mm,
st.dev=0.24mm) and right alare (mean=0.48mm, st.dev=0.29mm) had the smallest mean
Euclidean distances between landmark selections. However, when medians were considered
8
the largest distance occurred in the labiale superius landmark (median=0.70mm). The left alare
landmark had the largest variance in the Euclidean distance of the landmarks examined in this
study.
Discussion This study applied a technique of calculating multivariate intraclass correlation coefficients, a
widely used measure in other scientific fields, to measure the reliability of two sets of 3D soft-
tissue facial landmark selections. The multivariate measure was then compared to the
conventional methods used currently in reliability studies in the field of craniofacial
landmarking, specifically within-coordinate intraclass correlation coefficients and summary
measures by landmark. In contrast with these summary measures of coordinate-specific
intraclass correlations, the multivariate measure was as high or higher than the individual
coordinate measures in each of the landmarks examined in this study. As the maximum
eigenvalue of the variance-covariance matrix, the multivariate estimate takes into account both
the variances of all three dimensions simultaneously and their interrelationships as well; a
possible explanation of the differences between the multivariate and univariate procedures.
This does raise the question of direct numeric comparability of the measures and, further, the
interpretability of the multivariate measures in terms of a threshold value that would indicate a
clinical need for re-calibration of landmark selection.
The reliability of all six soft-tissue landmarks in this study was excellent (ICC >0.98). Lagravere
(2010) warns that landmarks with a mean difference of 1-2mm are clinically acceptable,
however landmarks with mean differences >2mm should be used with caution. All the
9
landmarks examined in this reliability study met this clinically acceptable criterion with the
maximum mean Euclidean distance of 0.84mm (left alare). When Lee’s (1989) criteria were
applied, no landmark had the lower bound of the 95% confidence interval of the ICC below the
0.75 threshold. However, a significant difference existed between the two landmark selections
in the X dimensions of pronasion (p=0.0014). Also, the significant difference was in the
negative direction indicating a possible bias of the second X coordinate being systematically
greater than the first. Using Lee’s criteria, the reliability of the landmark pronasion may not be
acceptable and may warrant review of selection protocol and re-calibration. Left alare had the
largest variance of the Euclidean distance between selections of all the landmarks in the study.
The lowest mean within-coordinate ICC was found in the left alare landmark in response to two
of the lower ICCs in the Y and Z coordinates, left alare also had the highest mean Euclidean
distance between landmark selections. However the lowest multivariate ICC was found in the
pronasion landmark which was the only landmark with a significant difference between the
coordinates (X dimension). It is possible that the univariate and multivariate ICCs are sensitive
to different characteristics within the landmark data. Further work is needed to determine the
relationship and comparability of the within-coordinate measures of agreement and the unified
multivariate estimator of agreement as well as the interpretability and clinically acceptable
level of the multivariate intraclass correlation coefficient.
10
References
Adams, G. L., Gansky, S. A., Miller, A. J., Harrell, W. E., & Hatcher, D. C. (2004). Comparison between traditional 2-dimensional cephalometry and a 3-dimensional approach on human dry skulls. American Journal of Orthodontics and Dentofacial Orthopedics, 126(4), 397-409. doi:DOI: 10.1016/j.ajodo.2004.03.023
Bland, J. M., & Altman, D. G. (1990). A note on the use of the intraclass correlation coefficient in the evaluation of agreement between two methods of measurement. Computers in Biology and Medicine, 20(5), 337-340. doi:DOI: 10.1016/0010-4825(90)90013-F
Chien, P., Parks, E., Eraso, F., Hartsfield, J., Roberts, W., & Ofner, S. (2009). Comparison of reliability in anatomical landmark identification using two-dimensional digital cephalometrics and three-dimensional cone beam computed tomography in vivo. Dentomaxillofacial Radiology, 38(5), 262-273. doi:10.1259/dmfr/81889955
Donner, A., & Bull, S. (1984). A comparison of significance-testing procedures for parent-child correlations computed from family data. Journal of the Royal Statistical Society.Series C (Applied Statistics), 33(3), pp. 278-284.
Donner, A., & Eliasziw, M. (1991). Methodology for inferences concerning familial correlations: A review. Journal of Clinical Epidemiology, 44(4-5), 449-455. doi:DOI: 10.1016/0895-4356(91)90084-M
Donner, A., & Eliasziw, M. (1991). Methodology for inferences concerning familial correlations: A review. Journal of Clinical Epidemiology, 44(4-5), 449-455. doi:DOI: 10.1016/0895-4356(91)90084-M
Fourie, Z., Damstra, J., Gerrits, P. O., & Ren, Y.Evaluation of anthropometric accuracy and reliability using different three-dimensional scanning systems. Forensic Science International, In Press, Corrected Proof doi:DOI: 10.1016/j.forsciint.2010.09.018
Haggard, EA. Intraclass Correlation and the Analysis of Variance. New York: Dryden. 1958.
Konishi, S. (1982). Asymptotic properties of estimators of interclass correlation from familial data Springer Netherlands. doi:10.1007/BF02481048
Konishi, S. (1985). Testing hypotheses about interclass correlations from familial data. Biometrics, 41(1), pp. 167-176.
Lagravère, M. O., Gordon, J. M., Guedes, I. H., Flores-Mir, C., Carey, J. P., Heo, G., & Major, P. W. (2009). Reliability of traditional cephalometric landmarks as seen in three-dimensional analysis in maxillary expansion treatments. The Angle Orthodontist, 79(6), 1047-1056. doi:10.2319/010509-10R.1
Lagravère, M. O., Low, C., Flores-Mir, C., Chung, R., Carey, J. P., Heo, G., & Major, P. W. (2010). Intraexaminer and interexaminer reliabilities of landmark identification on digitized lateral cephalograms and formatted 3-dimensional cone-beam computerized tomography images. American Journal of Orthodontics and Dentofacial Orthopedics, 137(5), 598-604. doi:DOI: 10.1016/j.ajodo.2008.07.018
Lee, J., Koh, D., & Ong, C. N. (1989). Statistical evaluation of agreement between two methods for measuring a quantitative variable. Computers in Biology and Medicine, 19(1), 61-70. doi:DOI: 10.1016/0010-4825(89)90036-X
Lele S, Richtsmeier J. An Invariant Approach to Statistical Analysis of Shapes. Chapman & Hall: Boca Raton, 2001.
Mian, I. U. H., & Shoukri, M. M. (1997). Statistical analysis of intraclass correlations from multiple samples with applications to arterial blood pressure data. Statistics in Medicine, 16(13), 1497-1514. doi:10.1002/(SICI)1097-0258(19970715)16:13<1497::AID-SIM569>3.0.CO;2-7
Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 420-428. doi:10.1037/0033-2909.86.2.420
11
Figure 1. Facial Soft Tissue Landmarks
Table 1. Intraclass Correlation Coefficients for Each Landmark By Coordinate and Multivariate Estimate (Adjusted Measures)
Landmark ICC 95% CI P value Mean Multivariate
ICC
1 Left Alare
X 0.9997 ( 0.9991, 0.9999 ) <0.0001
0.9912 0.9999 Y 0.9841 ( 0.9546, 0.9945 ) <0.0001
Z 0.9899 ( 0.9712, 0.9965 ) <0.0001
2 Labiale Superius
X 0.9960 ( 0.9885, 0.9986 ) <0.0001
0.9962 0.9994 Y 0.9972 ( 0.9918, 0.9990 ) <0.0001
Z 0.9954 ( 0.9868, 0.9984 ) <0.0001
3 Nasion
X 0.9996 ( 0.9987, 0.9998 ) <0.0001
0.9997 0.9999 Y 0.9999 ( 0.9996, 0.9999 ) <0.0001
Z 0.9996 ( 0.9987, 0.9998 ) <0.0001
4 Pronasion
X 0.9978 ( 0.9936, 0.9992 ) <0.0001
0.9953 0.9991 Y 0.9976 ( 0.9931, 0.9992 ) <0.0001
Z 0.9905 ( 0.9923, 0.9991) <0.0001
5 Right Alare
X 0.9999 ( 0.9997, 0.9999 ) <0.0001
0.9980 0.9999 Y 0.9968 ( 0.9909, 0.9989 ) <0.0001
Z 0.9972 ( 0.9919, 0.9990 ) <0.0001
6 Labiale Inferius
X 0.9986 ( 0.9960, 0.9995 ) <0.0001
0.9990 0.9996 Y 0.9996 ( 0.9988, 0.9999 ) <0.0001
Z 0.9988 ( 0.9965, 0.9996 ) <0.0001 All p values <0.0001; these are significance probabilities associated with the test of the null hypothesis that the intraclass correlation is equal to zero.
Sub Nasale
Nasion
Pronasion
Labiale Superius
Labiale Inferius
Right Alare
Left Alare
12
Table 2. Summary of Within-coordinate Differences (mm) Between Dual Landmark Selections
Landmark Coordinate Mean St. Dev. Min Max Median
T-test or Wilcoxon*
P-value
1 Left Alare
X 0.04 0.27 -0.70 0.48 0.07 0.5838
Y 0.04 0.85 -1.22 2.47 -0.02 0.7615*
Z -0.01 0.70 -1.34 1.43 -0.17 0.9409
2 Labiale Superius
X -0.19 0.47 -0.98 0.64 -0.18 0.3806
Y -0.11 0.75 -2.01 0.88 -0.09 0.2826
Z 0.03 0.28 -0.66 0.55 0.06 0.3608
3 Nasion
X -0.10 0.35 -0.73 0.55 -0.05 0.2713
Y -0.04 0.50 -0.94 0.79 0.03 0.7843
Z 0.01 0.10 -0.25 0.15 0.02 0.6548
4 Pronasion
X -0.22 0.22 -0.51 0.14 -0.29 0.0014
Y 0.13 0.39 -0.61 0.69 0.18 0.2029
Z 0.04 0.19 -0.32 0.52 0.02 0.3911
5 Right Alare
X 0.01 0.16 -0.27 0.30 0.00 0.8771
Y 0.16 0.39 -0.40 0.97 0.06 0.1279
Z 0.09 0.34 -0.46 0.56 -0.03 0.3415
6 Labiale Inferius
X -0.09 0.40 -0.77 0.63 -0.12 0.9242
Y 0.14 0.45 -0.72 0.96 0.19 0.4622
Z -0.01 0.18 -0.38 0.22 0.03 0.1946
7 Subnasale
X -0.17 0.50 -1.12 0.59 -0.08 0.2101
Y 0.10 0.54 -0.73 1.07 0.02 0.5081
Z 0.30 0.81 -1.11 1.68 0.31 0.1749 Significance probabilities associated with the t-test of the null hypothesis that the mean is equal to zero, or the Wilcoxon* signed rank that the median difference is equal to zero (assuming symmetry).
Table 3. Summary of Euclidean Distances Between Dual Landmark Selections
Landmark N Mean St Dev. Median Minimum Maximum
Left Alare 15 0.84 0.73 0.51 0.11 2.85
Nasion 15 0.53 0.31 0.52 0.10 1.19
Pronasion 15 0.49 0.24 0.45 0.08 0.91
Right Alare 15 0.48 0.29 0.52 0.13 1.12
Labiale Inferius 15 0.58 0.24 0.51 0.30 1.12
Labiale Superius 15 0.78 0.51 0.70 0.18 2.33