time series project - arima

8/10/2019 Time Series Project - ARIMA

http://slidepdf.com/reader/full/time-series-project-arima 1/21

MATH 6303

Multivariate Statistical Analysis

Final Project

By

Marius M. Mihai

December 3, 2014



Final Project – Multivariate Statistical Analysis December 2014

2 | P a g e

1. Do a discriminant analysis and write a report for the data of Table5.8.

Y1: Length of Cycle, Y2: Percentage of Rising Prices, Y3: Cyclical Amplitude, Y4: Rate of Change

The purpose of the discriminant analysis is to identify a linear combination of the variables

described above that would show the separation between consumer goods and producer goods.

But before we find the discriminant function, we need to see if the univariate differences in

groups are significant. Thus, we conduct 4 separate ANOVAs and analyze the results.

Thus for variable y1 (length of cycle), the p-value is less than 5% thus the difference is significant.




3 | P a g e

For variable y2 (% of rising prices), the difference between consumer and producer goods was

not significant.




4 | P a g e

For variable y3 (cyclical amplitude), the difference between groups is significant.

For variable y4 (rate of change), the difference between groups is also not significant.




5 | P a g e

The univariate ANOVA results yielded differences between y1 and y3, while for y2 and y4 it did

not identify any significant differences. The next step is to run a MANOVA to test for the overall

differences between consumer goods and producer goods. The results were pretty significant for

all test statistic values. The p-value was lower than 5% in all cases, thus there is a significant

difference between the two groups, consumer goods and producer goods.




6 | P a g e

Because the difference between consumer and producer goods was significant, the discriminant

analysis will help identify which variables contribute more to the difference between groups. The

analysis will follow as:

a) First, the discriminant function will be identified along with its coefficients and test if it is

significant.b) Second, the coefficients will be standardized in order to eliminate any unit issues, so that

we can analyze the contribution of each variable.

c) Third, a stepwise selection of variables will be applied to identify any redundancies.

The analysis was carried assuming that the covariance matrices were equal. The discriminant

function can be computed using a= (Spooled)-1 * (̅ −)̅ . Thus, a’= (-0.05689, -0.00971, -

0.24213, -0.0713). To test for the significance of this discriminant function, the Hotelling-T 2 was

computed. In the case of two groups, the discriminant function is significant if T2 is also

significant. It was proven that T2

= (n1+n2-2)*(1-Wilk’s lambda)/Wilk’s lambda. Hence from myMANOVA output, Wilk’s lambda=0.48, and so T2= (9+10-2)*(1-0.48)/0.48= 18.42. This test

statistics follows a distribution T2α=0.05, p=4, n1+n2-2=17=15.117. The test statistic is higher than the

table value, thus T2 is significant, resulting in a significant discriminant function. Thus the linear

discriminant function can be written as z=-0.06*y1-0.0097*y2-0.24*y3-0.071*y4

The standardized coefficients of the discriminant functions can be computed using the formula

a_standardized=√ ( )*a. The standardized coefficients were computed in SAS,

and a_standardized’= (-1.390, -0.083, -1.025, -0.032). By taking the absolute value, these

coefficients give us a good idea about the variable contribution in the model. Thus, the ranking

is as follows from the most important variable to the least important variable: y1 (length ofcycle), y3 (Cyclical amplitude), y2 (percentage of rising prices) and y4 (rate of change). These

results are comparable to the ones obtained in the individual ANOVAs, when the strongest

differences were in y1, and y3 (according to their p-values), while y2 and y4 did not exhibit any

significant differences.




7 | P a g e

The last step of the discriminant analysis is the stepwise procedure which will be conducted in

order to identify any redundancies in the data. The output from the stepwise procedure in SAS

can be found below. As expected, variable y1 (length of cycle) was entered first because it had

the highest F-value, followed by y3 (cyclical amplitude) because it has the second highest F-value.

After step 2, there were no more significant variables. This was expected, because y2 (percentage

of rising prices), and y4 (rate of change) were not found significant in the individual ANOVAs run

in the initial step of the analysis. Also, according to the standardized discriminant function, y2

and y4 have the lowest contribution. Thus the reduced model with only y1 and y3, is as good as

the full model because y2 and y4 appear to be redundant in the full model.




8 | P a g e

2. Do a classification analysis and write a report on Table5.6.

Y1: Intelligence, Y2: Form Relations, Y3: Dynamometer, Y4: Dotting, Y5: Sensory Motor

Coordination, Y6: Perseveration

Classification analysis is discriminant analysis taken one step forward. The purpose of this analysisis to tell us where to place future subjects with various scores in intelligence, form relations,

dynamometer, dotting, sensory motor coordination, and perseveration. In our case, we have a

two group classification analysis, engineer apprentices and pilots. It is important to note that any

preliminary analysis on this data such as ANOVA, MANOVA, and tests for normality, was done

previously in the midterm exam. Similarly to the previous problem, the analysis will be carried as

follows:

a) First, the discriminant function will be identified along with its coefficients and test if it is

significant.

b) Second, the coefficients will be standardized in order to eliminate any unit issues, so that

we can analyze the contribution of each variable.

c) Test for the equality of covariance matrices

d) We classify each observation based on both the linear discriminant function and the

quadratic discriminant function, and estimate the error rates

e) We use the holdout method to see how it compares with the previous two.




9 | P a g e

The discriminant function was computed in SAS. Same as in the analysis done in problem 1, the

vector a’= (0.0075, 0.1933, -0.129, -0.043, 0.072,-0.049) contains the coefficients of the linear

discriminant function. The T2=66.7 was computed in the midterm exam, and it was significant;

hence the linear discriminant function is also significant.

The next step was to compute the standardized coefficient, to identify the contribution of eachvariable in the overall model. The standardized coefficients were (0.174, 1.496, -1.391, -1.280,

1.131, -1.440). Taking their absolute value, the ranking from the most important to the least

important is as follows: y2 (form relations), y6 (perseveration), y3 (dynamometer), y4 (dotting),

y5 (sensory motor coordination), ad y1 (intelligence). These results are on par with the ones in

the midterm, when sensory motor coordination, and intelligence appeared to be redundant in

the full model. In the discriminant analysis, these two variables were last in the level of

importance.

An assumption that was not tested in the midterm is particularly important for this analysis: the

equality of the covariance matrices. This assumption was tested in a previous homework (seeproblem 7.22), and the covariance matrices appeared to be equal.

With the covariance matrices being equal, the first classification will be done based on the linear

discriminant function a’= (0.0075, 0.1933, -0.129, -0.043, 0.072,-0.049). The next page contains

the analysis which was done in Microsoft Excel. Based on the analysis, there were two

misclassifications in engineer apprentices, and two misclassifications in pilots. This would yield

an error rate of (2+2)/ (20+20) =0.1 (10%). The error will be compared to other error rates which

will be obtained further in the analysis in order to judge the ability to predict group membership.



Apprentices Pilots

y3 y4 y5 y6 Value Decision y1 y2 y3 y4 y5 y6 Value Decis

22 74 223 54 254 -22.5221 pilots 132 17 77 232 50 249 -24.2204 pilo

30 80 175 40 300 -23.0577 pilots 123 32 79 192 64 315 -22.1684 pilo

49 87 266 41 223 -20.2287 engineer apprentices 129 31 96 250 55 319 -27.8371 pilo










40 60 232 29 279 -20.7101 engineer apprentices 100 35 83 183 57 242 -18.8145 engineer ap







42 82 188 57 267 -18.7053 engineer apprentices 164 32 76 187 30 264 -21.1992 engineer ap

38 1 76 2 192 75 53 65 250 3 129 3 31 7 87 4 236 6 44 25 280 2



The following classification will be done based on a quadratic classification function. Although

the sample covariance matrices did not yield any significant differences, for the purpose of this

analysis we will try to compare the error rates from the two models. The SAS results are below.

It appears that using a quadratic discriminant function will yield a similar error rate of 10%.

A third type of classification analysis will be the holdout method. Again, I copied the SAS results

on the following page. There were 4 misclassifications in engineer apprentices, and 2 in pilots for

a total error rate of 0.1750. As expected, the error rate increased compared to the previous two

methods giving a more realistic expectation of how the linear discriminant function can perform

for future data subjects.




12 | P a g e




13 | P a g e

3. Do a regression analysis and write a report on Table 3.4.

The first step in this multivariate regression analysis is to try to estimate the parameters, which

is matrix ̂. The parameter matrix was computed in SAS, and the output is below. The first table

is the set of parameters for y1 (relative weight) and the second one is the set of parameters for

y2 (fasting plasma glucose).

The overall regression appears to be significant at α=5%, as indicated by all four tests shown

below. However, the R2 values appear to be relatively low for the overall model. Only 25% of the

variability in y1 (relative weight) can be explained by x1 (glucose intolerance), x2 (insulin response

to oral glucose) and x3 (insulin resistance), and only 1.6% of the variability in y2 (fasting plasma

glucose) can be explained by x1, x2, and x3.




14 | P a g e




15 | P a g e

The relatively low values of R2 suggest that more explanatory variables may be needed in order

to improve the model. However, for the purpose of this analysis we will run a stepwise procedure

in order to identify redundancies. A backward elimination will be applied to find a subset of the

x’s.

To find the subset of the x’s we compute a conditional Wilk’s lambda by formula 10.72 in thebook. Thus, for example, Wilk’s lambda (X1|X2 X3) = (Wilk’s lambda (X1, X2, X3)/Wilk’s lambda

(X2 X3)). This would be the first value that would go under X1 in the table below. The rest of the

x’s are computed similarly. No elimination could be done at step 1 because the highest Wilk’s

lambda (0.93) was significant at α=5%. Thus the largest Wilk’s lambda is significant so the

backward elimination process would have to stop there.

Step # X1 (Glucose

Intolerance)

X2 (Insulin Response to

Oral Glucose)

X3 (Insulin Resistance)

1 Wilk’s Lambda=0.93 Wilk’s Lambda=0.89 Wilk’s Lambda=0.76

It appears that none of the independent variables could be eliminated and all three x’s are

needed in the full model. However, because of the small values of R2, they seem to explain only

a very small portion of the variability in y’s. Thus, as I mentioned earlier, this model needs more

explanatory variables in order to increase its accuracy.




16 | P a g e

12.8. Carry out a principal component analysis on all six variables of the glucose data of Table

3.8. Use both S and R. Which do you think is more appropriate here? Show the percent of

variance explained. Based on the average eigenvalue or a scree plot, decide how many

components to retain. Can you interpret the components of either S or R?

The purpose of principal component analysis is to eliminate variables and optimize the model.For the data in Table 3.8 principal components were computed on both S and R.

First I will present the runs on the correlation matrix. The first four eigenvalues account for about

85% of the variance which is greater than 80%, so we can keep the first four.




17 | P a g e

In the case of the covariance matrix, the first three eigenvalues account for 89% of the variability,

which is higher than 80%. Thus, we can keep the first three principal components. This makes

sense because the variances are significantly influenced by the larger variances of x1, x2, and x3.




18 | P a g e

In this case, because of the disparate variances in S, choosing the principal components from R

will be more appropriate.

For the interpretation of the principal components in the case of R, we need a correlation

procedure between the chosen components and the variables. The runs were done in SAS and

they can be seen circled in the figure below. The correlations between the principal componentsand the variables differ, and only the ones above 0.5 were deemed to be significant. For example,

after selecting the first four principal components, a significant correlation (over 0.5) can be

identified between the first principal component and variables y1, y3, x1, x2, and x3. Significant

correlations can also be identified between the second principal component and y2. X2 has a

significant correlation with the third component, while y1, and y3 are strongly correlated with

the fourth principal component.




19 | P a g e

12.12 Carry out a principal component analysis on the engineer data of Table 5.6 as follows:

(a) Use the pooled covariance matrix.

(b) Ignore groups and use a covariance matrix based on all 40 observations.

(c) Which of the approaches in (a) or (b) appears to be more successful?

Here, we are running a principal component analysis using an unpooled covariance matrix, and apooled covariance matrix. The two matrices were computed in SAS and are shown in the figures

below. The first figure is the pooled covariance matrix and the second figure is the unpooled

covariance matrix.




20 | P a g e

First we run the component analysis on the unpooled covariance matrix. The following results

were obtained in SAS. The first three components account for 87% of the variance, thus it will be

enough to keep them. Thus the first three components are a1’= (0.212, -0.039, 0.08, 0.775, -

0.956, 0.580), a2’= (0.389, 0.064, -0.066, -0.608, 0.01, 0.686), and a3’= (0.889, 0.096, 0.08, 0.08,

0.01, -0.434)

For the unpooled matrix, I could not use a procedure in SAS so I computed in IML. The output is

copied on the next page. The table under the figures gives the cumulative proportion of the

eigenvalues in the overall model. Similar to the analysis done for the pooled covariance matrix,

the first three eigenvalues account for about 85% of the total variance, thus the first three

eigenvectors (components) can be kept in the model.

Given that the two analyses are very similar, it appears that neither is more successful and that

the results are independent of the choice made: to use the pooled covariance matrix, or the

unpooled covariance matrix.




21 | P a g e

Eigenvalue Proportion Cumulative Proportion

1,050.5963 38.6% 38.6%

858.3158 31.6% 70.2%

398.9035 14.7% 84.9%

259.1484 9.5% 94.4%

108.0892 4.0% 98.4%

43.3535 1.6% 100.0%

time series project - arima

Documents