time series project - arima
TRANSCRIPT
8/10/2019 Time Series Project - ARIMA
http://slidepdf.com/reader/full/time-series-project-arima 1/21
MATH 6303
Multivariate Statistical Analysis
Final Project
By
Marius M. Mihai
December 3, 2014
8/10/2019 Time Series Project - ARIMA
http://slidepdf.com/reader/full/time-series-project-arima 2/21
Final Project – Multivariate Statistical Analysis December 2014
2 | P a g e
1. Do a discriminant analysis and write a report for the data of Table5.8.
Y1: Length of Cycle, Y2: Percentage of Rising Prices, Y3: Cyclical Amplitude, Y4: Rate of Change
The purpose of the discriminant analysis is to identify a linear combination of the variables
described above that would show the separation between consumer goods and producer goods.
But before we find the discriminant function, we need to see if the univariate differences in
groups are significant. Thus, we conduct 4 separate ANOVAs and analyze the results.
Thus for variable y1 (length of cycle), the p-value is less than 5% thus the difference is significant.
8/10/2019 Time Series Project - ARIMA
http://slidepdf.com/reader/full/time-series-project-arima 3/21
Final Project – Multivariate Statistical Analysis December 2014
3 | P a g e
For variable y2 (% of rising prices), the difference between consumer and producer goods was
not significant.
8/10/2019 Time Series Project - ARIMA
http://slidepdf.com/reader/full/time-series-project-arima 4/21
Final Project – Multivariate Statistical Analysis December 2014
4 | P a g e
For variable y3 (cyclical amplitude), the difference between groups is significant.
For variable y4 (rate of change), the difference between groups is also not significant.
8/10/2019 Time Series Project - ARIMA
http://slidepdf.com/reader/full/time-series-project-arima 5/21
Final Project – Multivariate Statistical Analysis December 2014
5 | P a g e
The univariate ANOVA results yielded differences between y1 and y3, while for y2 and y4 it did
not identify any significant differences. The next step is to run a MANOVA to test for the overall
differences between consumer goods and producer goods. The results were pretty significant for
all test statistic values. The p-value was lower than 5% in all cases, thus there is a significant
difference between the two groups, consumer goods and producer goods.
8/10/2019 Time Series Project - ARIMA
http://slidepdf.com/reader/full/time-series-project-arima 6/21
Final Project – Multivariate Statistical Analysis December 2014
6 | P a g e
Because the difference between consumer and producer goods was significant, the discriminant
analysis will help identify which variables contribute more to the difference between groups. The
analysis will follow as:
a) First, the discriminant function will be identified along with its coefficients and test if it is
significant.b) Second, the coefficients will be standardized in order to eliminate any unit issues, so that
we can analyze the contribution of each variable.
c) Third, a stepwise selection of variables will be applied to identify any redundancies.
The analysis was carried assuming that the covariance matrices were equal. The discriminant
function can be computed using a= (Spooled)-1 * (̅ −)̅ . Thus, a’= (-0.05689, -0.00971, -
0.24213, -0.0713). To test for the significance of this discriminant function, the Hotelling-T 2 was
computed. In the case of two groups, the discriminant function is significant if T2 is also
significant. It was proven that T2
= (n1+n2-2)*(1-Wilk’s lambda)/Wilk’s lambda. Hence from myMANOVA output, Wilk’s lambda=0.48, and so T2= (9+10-2)*(1-0.48)/0.48= 18.42. This test
statistics follows a distribution T2α=0.05, p=4, n1+n2-2=17=15.117. The test statistic is higher than the
table value, thus T2 is significant, resulting in a significant discriminant function. Thus the linear
discriminant function can be written as z=-0.06*y1-0.0097*y2-0.24*y3-0.071*y4
The standardized coefficients of the discriminant functions can be computed using the formula
a_standardized=√ ( )*a. The standardized coefficients were computed in SAS,
and a_standardized’= (-1.390, -0.083, -1.025, -0.032). By taking the absolute value, these
coefficients give us a good idea about the variable contribution in the model. Thus, the ranking
is as follows from the most important variable to the least important variable: y1 (length ofcycle), y3 (Cyclical amplitude), y2 (percentage of rising prices) and y4 (rate of change). These
results are comparable to the ones obtained in the individual ANOVAs, when the strongest
differences were in y1, and y3 (according to their p-values), while y2 and y4 did not exhibit any
significant differences.
8/10/2019 Time Series Project - ARIMA
http://slidepdf.com/reader/full/time-series-project-arima 7/21
Final Project – Multivariate Statistical Analysis December 2014
7 | P a g e
The last step of the discriminant analysis is the stepwise procedure which will be conducted in
order to identify any redundancies in the data. The output from the stepwise procedure in SAS
can be found below. As expected, variable y1 (length of cycle) was entered first because it had
the highest F-value, followed by y3 (cyclical amplitude) because it has the second highest F-value.
After step 2, there were no more significant variables. This was expected, because y2 (percentage
of rising prices), and y4 (rate of change) were not found significant in the individual ANOVAs run
in the initial step of the analysis. Also, according to the standardized discriminant function, y2
and y4 have the lowest contribution. Thus the reduced model with only y1 and y3, is as good as
the full model because y2 and y4 appear to be redundant in the full model.
8/10/2019 Time Series Project - ARIMA
http://slidepdf.com/reader/full/time-series-project-arima 8/21
Final Project – Multivariate Statistical Analysis December 2014
8 | P a g e
2. Do a classification analysis and write a report on Table5.6.
Y1: Intelligence, Y2: Form Relations, Y3: Dynamometer, Y4: Dotting, Y5: Sensory Motor
Coordination, Y6: Perseveration
Classification analysis is discriminant analysis taken one step forward. The purpose of this analysisis to tell us where to place future subjects with various scores in intelligence, form relations,
dynamometer, dotting, sensory motor coordination, and perseveration. In our case, we have a
two group classification analysis, engineer apprentices and pilots. It is important to note that any
preliminary analysis on this data such as ANOVA, MANOVA, and tests for normality, was done
previously in the midterm exam. Similarly to the previous problem, the analysis will be carried as
follows:
a) First, the discriminant function will be identified along with its coefficients and test if it is
significant.
b) Second, the coefficients will be standardized in order to eliminate any unit issues, so that
we can analyze the contribution of each variable.
c) Test for the equality of covariance matrices
d) We classify each observation based on both the linear discriminant function and the
quadratic discriminant function, and estimate the error rates
e) We use the holdout method to see how it compares with the previous two.
8/10/2019 Time Series Project - ARIMA
http://slidepdf.com/reader/full/time-series-project-arima 9/21
Final Project – Multivariate Statistical Analysis December 2014
9 | P a g e
The discriminant function was computed in SAS. Same as in the analysis done in problem 1, the
vector a’= (0.0075, 0.1933, -0.129, -0.043, 0.072,-0.049) contains the coefficients of the linear
discriminant function. The T2=66.7 was computed in the midterm exam, and it was significant;
hence the linear discriminant function is also significant.
The next step was to compute the standardized coefficient, to identify the contribution of eachvariable in the overall model. The standardized coefficients were (0.174, 1.496, -1.391, -1.280,
1.131, -1.440). Taking their absolute value, the ranking from the most important to the least
important is as follows: y2 (form relations), y6 (perseveration), y3 (dynamometer), y4 (dotting),
y5 (sensory motor coordination), ad y1 (intelligence). These results are on par with the ones in
the midterm, when sensory motor coordination, and intelligence appeared to be redundant in
the full model. In the discriminant analysis, these two variables were last in the level of
importance.
An assumption that was not tested in the midterm is particularly important for this analysis: the
equality of the covariance matrices. This assumption was tested in a previous homework (seeproblem 7.22), and the covariance matrices appeared to be equal.
With the covariance matrices being equal, the first classification will be done based on the linear
discriminant function a’= (0.0075, 0.1933, -0.129, -0.043, 0.072,-0.049). The next page contains
the analysis which was done in Microsoft Excel. Based on the analysis, there were two
misclassifications in engineer apprentices, and two misclassifications in pilots. This would yield
an error rate of (2+2)/ (20+20) =0.1 (10%). The error will be compared to other error rates which
will be obtained further in the analysis in order to judge the ability to predict group membership.
8/10/2019 Time Series Project - ARIMA
http://slidepdf.com/reader/full/time-series-project-arima 10/21
Apprentices Pilots
y3 y4 y5 y6 Value Decision y1 y2 y3 y4 y5 y6 Value Decis
22 74 223 54 254 -22.5221 pilots 132 17 77 232 50 249 -24.2204 pilo
30 80 175 40 300 -23.0577 pilots 123 32 79 192 64 315 -22.1684 pilo
49 87 266 41 223 -20.2287 engineer apprentices 129 31 96 250 55 319 -27.8371 pilo
37 66 178 80 209 -12.9194 engineer apprentices 131 23 67 291 48 310 -27.4374 pilo
35 71 175 38 261 -18.9185 engineer apprentices 110 24 96 239 42 268 -27.2938 pilo
37 57 241 59 245 -17.4942 engineer apprentices 47 22 87 231 40 217 -24.2903 pilo
39 52 194 72 242 -13.2495 engineer apprentices 125 32 87 227 30 324 -27.5684 pilo
34 89 200 85 242 -18.2732 engineer apprentices 129 29 102 234 58 300 -27.1664 pilo
55 91 198 50 277 -18.4763 engineer apprentices 130 26 104 256 58 270 -27.4665 pilo
38 72 162 47 268 -17.6913 engineer apprentices 147 47 82 240 30 322 -24.3164 pilo
37 87 170 60 244 -17.8653 engineer apprentices 159 37 80 227 58 317 -23.0874 pilo
33 88 208 51 228 -20.3226 engineer apprentices 135 41 83 216 39 306 -23.2369 pilo
40 60 232 29 279 -20.7101 engineer apprentices 100 35 83 183 57 242 -18.8145 engineer ap
39 73 159 39 233 -16.424 engineer apprentices 149 37 94 227 30 240 -23.2048 pilo
21 83 152 88 233 -17.3899 engineer apprentices 149 38 78 258 42 271 -22.9299 pilo
42 80 195 36 241 -18.739 engineer apprentices 153 27 89 283 66 291 -26.7734 pilo
49 73 152 42 249 -14.6486 engineer apprentices 136 31 83 257 31 311 -27.7362 pilo
37 76 223 74 268 -18.9062 engineer apprentices 97 36 100 252 30 225 -24.8977 pilo
46 83 164 31 243 -17.8151 engineer apprentices 141 37 105 250 27 243 -26.0327 pilo
42 82 188 57 267 -18.7053 engineer apprentices 164 32 76 187 30 264 -21.1992 engineer ap
38 1 76 2 192 75 53 65 250 3 129 3 31 7 87 4 236 6 44 25 280 2
8/10/2019 Time Series Project - ARIMA
http://slidepdf.com/reader/full/time-series-project-arima 11/21
The following classification will be done based on a quadratic classification function. Although
the sample covariance matrices did not yield any significant differences, for the purpose of this
analysis we will try to compare the error rates from the two models. The SAS results are below.
It appears that using a quadratic discriminant function will yield a similar error rate of 10%.
A third type of classification analysis will be the holdout method. Again, I copied the SAS results
on the following page. There were 4 misclassifications in engineer apprentices, and 2 in pilots for
a total error rate of 0.1750. As expected, the error rate increased compared to the previous two
methods giving a more realistic expectation of how the linear discriminant function can perform
for future data subjects.
8/10/2019 Time Series Project - ARIMA
http://slidepdf.com/reader/full/time-series-project-arima 12/21
Final Project – Multivariate Statistical Analysis December 2014
12 | P a g e
8/10/2019 Time Series Project - ARIMA
http://slidepdf.com/reader/full/time-series-project-arima 13/21
Final Project – Multivariate Statistical Analysis December 2014
13 | P a g e
3. Do a regression analysis and write a report on Table 3.4.
The first step in this multivariate regression analysis is to try to estimate the parameters, which
is matrix ̂. The parameter matrix was computed in SAS, and the output is below. The first table
is the set of parameters for y1 (relative weight) and the second one is the set of parameters for
y2 (fasting plasma glucose).
The overall regression appears to be significant at α=5%, as indicated by all four tests shown
below. However, the R2 values appear to be relatively low for the overall model. Only 25% of the
variability in y1 (relative weight) can be explained by x1 (glucose intolerance), x2 (insulin response
to oral glucose) and x3 (insulin resistance), and only 1.6% of the variability in y2 (fasting plasma
glucose) can be explained by x1, x2, and x3.
8/10/2019 Time Series Project - ARIMA
http://slidepdf.com/reader/full/time-series-project-arima 14/21
Final Project – Multivariate Statistical Analysis December 2014
14 | P a g e
8/10/2019 Time Series Project - ARIMA
http://slidepdf.com/reader/full/time-series-project-arima 15/21
Final Project – Multivariate Statistical Analysis December 2014
15 | P a g e
The relatively low values of R2 suggest that more explanatory variables may be needed in order
to improve the model. However, for the purpose of this analysis we will run a stepwise procedure
in order to identify redundancies. A backward elimination will be applied to find a subset of the
x’s.
To find the subset of the x’s we compute a conditional Wilk’s lambda by formula 10.72 in thebook. Thus, for example, Wilk’s lambda (X1|X2 X3) = (Wilk’s lambda (X1, X2, X3)/Wilk’s lambda
(X2 X3)). This would be the first value that would go under X1 in the table below. The rest of the
x’s are computed similarly. No elimination could be done at step 1 because the highest Wilk’s
lambda (0.93) was significant at α=5%. Thus the largest Wilk’s lambda is significant so the
backward elimination process would have to stop there.
Step # X1 (Glucose
Intolerance)
X2 (Insulin Response to
Oral Glucose)
X3 (Insulin Resistance)
1 Wilk’s Lambda=0.93 Wilk’s Lambda=0.89 Wilk’s Lambda=0.76
It appears that none of the independent variables could be eliminated and all three x’s are
needed in the full model. However, because of the small values of R2, they seem to explain only
a very small portion of the variability in y’s. Thus, as I mentioned earlier, this model needs more
explanatory variables in order to increase its accuracy.
8/10/2019 Time Series Project - ARIMA
http://slidepdf.com/reader/full/time-series-project-arima 16/21
Final Project – Multivariate Statistical Analysis December 2014
16 | P a g e
12.8. Carry out a principal component analysis on all six variables of the glucose data of Table
3.8. Use both S and R. Which do you think is more appropriate here? Show the percent of
variance explained. Based on the average eigenvalue or a scree plot, decide how many
components to retain. Can you interpret the components of either S or R?
The purpose of principal component analysis is to eliminate variables and optimize the model.For the data in Table 3.8 principal components were computed on both S and R.
First I will present the runs on the correlation matrix. The first four eigenvalues account for about
85% of the variance which is greater than 80%, so we can keep the first four.
8/10/2019 Time Series Project - ARIMA
http://slidepdf.com/reader/full/time-series-project-arima 17/21
Final Project – Multivariate Statistical Analysis December 2014
17 | P a g e
In the case of the covariance matrix, the first three eigenvalues account for 89% of the variability,
which is higher than 80%. Thus, we can keep the first three principal components. This makes
sense because the variances are significantly influenced by the larger variances of x1, x2, and x3.
8/10/2019 Time Series Project - ARIMA
http://slidepdf.com/reader/full/time-series-project-arima 18/21
Final Project – Multivariate Statistical Analysis December 2014
18 | P a g e
In this case, because of the disparate variances in S, choosing the principal components from R
will be more appropriate.
For the interpretation of the principal components in the case of R, we need a correlation
procedure between the chosen components and the variables. The runs were done in SAS and
they can be seen circled in the figure below. The correlations between the principal componentsand the variables differ, and only the ones above 0.5 were deemed to be significant. For example,
after selecting the first four principal components, a significant correlation (over 0.5) can be
identified between the first principal component and variables y1, y3, x1, x2, and x3. Significant
correlations can also be identified between the second principal component and y2. X2 has a
significant correlation with the third component, while y1, and y3 are strongly correlated with
the fourth principal component.
8/10/2019 Time Series Project - ARIMA
http://slidepdf.com/reader/full/time-series-project-arima 19/21
Final Project – Multivariate Statistical Analysis December 2014
19 | P a g e
12.12 Carry out a principal component analysis on the engineer data of Table 5.6 as follows:
(a) Use the pooled covariance matrix.
(b) Ignore groups and use a covariance matrix based on all 40 observations.
(c) Which of the approaches in (a) or (b) appears to be more successful?
Here, we are running a principal component analysis using an unpooled covariance matrix, and apooled covariance matrix. The two matrices were computed in SAS and are shown in the figures
below. The first figure is the pooled covariance matrix and the second figure is the unpooled
covariance matrix.
8/10/2019 Time Series Project - ARIMA
http://slidepdf.com/reader/full/time-series-project-arima 20/21
Final Project – Multivariate Statistical Analysis December 2014
20 | P a g e
First we run the component analysis on the unpooled covariance matrix. The following results
were obtained in SAS. The first three components account for 87% of the variance, thus it will be
enough to keep them. Thus the first three components are a1’= (0.212, -0.039, 0.08, 0.775, -
0.956, 0.580), a2’= (0.389, 0.064, -0.066, -0.608, 0.01, 0.686), and a3’= (0.889, 0.096, 0.08, 0.08,
0.01, -0.434)
For the unpooled matrix, I could not use a procedure in SAS so I computed in IML. The output is
copied on the next page. The table under the figures gives the cumulative proportion of the
eigenvalues in the overall model. Similar to the analysis done for the pooled covariance matrix,
the first three eigenvalues account for about 85% of the total variance, thus the first three
eigenvectors (components) can be kept in the model.
Given that the two analyses are very similar, it appears that neither is more successful and that
the results are independent of the choice made: to use the pooled covariance matrix, or the
unpooled covariance matrix.
8/10/2019 Time Series Project - ARIMA
http://slidepdf.com/reader/full/time-series-project-arima 21/21
Final Project – Multivariate Statistical Analysis December 2014
21 | P a g e
Eigenvalue Proportion Cumulative Proportion
1,050.5963 38.6% 38.6%
858.3158 31.6% 70.2%
398.9035 14.7% 84.9%
259.1484 9.5% 94.4%
108.0892 4.0% 98.4%
43.3535 1.6% 100.0%