cy young predictor

30
5/9/2009 An Investigation in Value Within Baseball | Christian Karayannides ECONOMETR ICS CY YOUNG PREDICTOR

Upload: christian-karayannides

Post on 28-Mar-2016

295 views

Category:

Documents


0 download

DESCRIPTION

A paper I wrote for Econometrics in 2009.

TRANSCRIPT

Page 1: Cy Young Predictor

5/9/2009An Investigation in Value Within Baseball | Christian Karayannides

ECONOMETRICS

CY YOUNG PREDICTOR

Page 2: Cy Young Predictor

Table of Contents

Table of Contents........................................................................................................................................2

Introduction.................................................................................................................................................4

Literature Review........................................................................................................................................4

Equation 1: Pitching Runs....................................................................................................................5

Equation 2: Bill James Cy Young Predictor...........................................................................................5

Theory.........................................................................................................................................................6

Equation 3: Estimated Cy Young Predictor..........................................................................................6

Data Overview.............................................................................................................................................8

Figure 1 Cy Young Award Points..........................................................................................................8

Figure 2: Average Cy Young Points......................................................................................................9

Figure 3: Cy Young Points by Strikeouts...............................................................................................9

Figure 4: Cy Young Points by Wins.......................................................................................................9

Correlations and Descriptions................................................................................................................10

Figure 5: Plotted Variable Correlations..............................................................................................10

Figure 6: Correlation Table................................................................................................................11

Equations and Tests...................................................................................................................................12

Equation 4.........................................................................................................................................12

Equation 5.........................................................................................................................................12

Omitted and Irrelevant Variables..........................................................................................................12

Equation 6.........................................................................................................................................13

Equation 7.........................................................................................................................................13

Multicollinearity....................................................................................................................................14

Serial Correlation...................................................................................................................................15

Figure 7 Durbin Watson d Statistic Calculation..................................................................................16

Figure 8: Durbin Watson d Statistic Interpretation Tool....................................................................16

Heteroskedasticity.................................................................................................................................16

Figure 9 Plot of Residuals...................................................................................................................17

Figure 10: Park Test Results...............................................................................................................17

2

Page 3: Cy Young Predictor

Alternative Testing................................................................................................................................18

Figure 11: Observed and Estimated Cy Young Results from 2001-2004............................................18

Errors.........................................................................................................................................................19

Conclusion and Results..............................................................................................................................20

Figure 12: Prediction of 2009 Cy Young Winner Based on Season Statistics as of May 7th 2009.......20

Equation 8: Bill James Cy Young Predictor.........................................................................................20

Equation 9: Estimated Cy Young Predictor........................................................................................20

Bibliography...............................................................................................................................................22

3

Page 4: Cy Young Predictor

Introduction

Baseball is often referred to as a statistical sport. Compared to other sports, statistical analysis is much more appropriate in baseball because of the structure of the game. Unlike other team sports, baseball involves individual competition and repetitive situations. It is therefore easy to assess the individual value or contribution of each player. In other team sports, this is not as simple because of the effect which teammates have on each individual’s success.

There are many other factors which contribute to the statistical nature of baseball. The level of failure in baseball is unprecedented in other team sports. If a player fails to get a base-hit in less than two thirds of his at-bats over the course of his career, he will be a consistent all-star. All baseball players are extremely inconsistent, and on any given day, their level of success can vary dramatically. The value of a player therefore, cannot be properly estimated by examining his level of ability in one game. The inconsistency inherent in the game necessitates the use of statistics in order to properly illustrate a player’s value.

The inconsistent nature of baseball is combated by playing 162 games a year. The long season reduces the inconsistency of teams and improves the chance that their records will reflect their respective levels of talent. Similarly, by recording a player’s success over the course of the season, a less erratic evaluation of his talent can be achieved.

There are many parallels between economics and baseball. These parallels portray baseball as an interesting alternative for economics students to study. Baseball statistics are intended to identify the most efficient contributors to a team’s success. The efficient use of at-bats to create as many base-hits as possible can easily be compared to the efficient use of costs to create as much revenue as possible. The relationships between at-bats to hits as well as revenue to costs are intrinsic elements in their respective fields.

Statistics are an essential element to modern baseball. Baseball statistics provide jobs, attract fans, help determine ticket prices, govern managerial and player development strategies, and establish the salaries of players.1

The goal of this report is to create a regression equation which effectively predicts the Cy Young Award winner. The equation can be used to estimate who will win the award by examining the statistics at the end of the year, or at any point throughout the season. The effectiveness of the equation can be tested by inputting data from past years and examining the results to see if the equation produces the actual winner of the award.

Literature Review

In the 1990s, a wave of new statistical analysis called sabermetrics challenged the conventional methods of statistical assessment in baseball. Sabermetrics were invented and

1 Jim Albert pg. 1

4

Page 5: Cy Young Predictor

popularized by acclaimed baseball statistician, Bill James.2 “The term (sabermetrics), which James coined two decades ago, echoes the acronym Society for American Baseball Research, and denotes the search for objective knowledge about baseball.”3 James pioneered the reassessment of baseball statistics such as batting average and introduced new statistics such as OBP (on-base-percentage), SLG (slugging-percentage), and WHIP ([Walks +Hits]/Innings Pitched).

In order to produce an adequate equation that predicts the Cy Young Award winner, the contribution of sabermetrics must be understood and incorporated. In Jim Albert’s essay, “An Introduction to Sabermetrics”, he reveals the different historical contributors to sabermetric development.4 He also explains a number of important sabermetric calculations, by providing the equation and describing its function and utility. Albert provides a useful sabermetric measure of pitching ability called pitching runs.5

Pitching Runs = (League ERA / 9) x (Innings Pitched) – ER

Equation 1: Pitching Runs

The league earned run average (ERA) divided by nine represents the average number of earned runs allowed per inning by an average pitcher in that year.6 By multiplying by the individual pitcher’s innings and then subtracting his earned runs (ER) allowed, the product of pitching runs is calculated. A result of zero will indicate a pitcher who was average in comparison with his peers.7 A result higher than zero will indicate an above average pitcher. This statistic evaluates pitchers much more effectively than traditional values such as wins or ERA. Albert explains that the flaw of using wins to evaluate pitchers is that the offense supporting the pitcher is too large a factor.8 Wins, therefore are more successful at illustrating a team’s ability rather than that of the pitcher who earns the win. Pitching runs is an improvement of ERA in that it factors in the league ERA and is subject to the pitcher’s durability in terms of innings.

In his book, The Neyer/James Guide to Pitchers, Bill James proposes a Cy Young predictor which he calls, “E = M CY Squared”.9 His Cy Young formula is as follows.

Cy Young Points = 6W – 2L +K/12 + 2.5S + H + R + 12F

Equation 2: Bill James Cy Young Predictor

W = Wins L = Losses

2 Ben McGrath pg. 13 Ben McGrath pg. 14 Jim Albert pg. 15 Jim Albert pg. 16 Jim Albert pg. 17 Jim Albert pg. 18 Jim Albert pg. 19 Bill James/Rob Neyer pg.467

5

Page 6: Cy Young Predictor

K = Strikeouts S = SavesH = shutouts R = Runs SavedF = dummy variable which is 1 if pitched for a first-place team and 0 otherwise.

This formula is very effective when tested with historical data. The equation identified the Cy Young Award winner with 81% accuracy.10 It also provides rankings for the top five picks. This formula can be applied to future final season statistics, or to the statistics accumulated up to any point within the season to determine the frontrunners for the award. Unfortunately, James believes that this formula is more accurate at predicting past award winners, than it will be at predicting future winners.11 This is due to the popularity of new statistics and their effect on the decisions of the Cy Young voters.12

A useful addition to any Cy Young prediction formula is the advanced sabermetric statistic, Win Shares. Dave Studeman’s article explains the utility and significance of win shares. Calculating win shares is an extremely complicated process, but the result is a simple number which represents a player’s total contribution to a team’s wins.13 Win shares adjust for ballpark factors and clutch performances, reducing the possible error term and providing an accurate representation of a player’s performance instead of his talent.14 The addition of win shares would update the outdated Cy Young formula to include modern sabermetrics that are often considered by voters.

Theory

The purpose of this analysis is to construct an equation that can predict the Cy Young Award winner. The winner of the Cy Young is chosen based on weighted voting totals called Cy Young Award points. This equation should predict the Cy Young Award points of any pitcher by imputing their season statistics. The independent variables used in the estimate, their hypothesized signs, and an explanation of their signs are listed below.

(+) (-) (-) (+) (+) (+) (+)CY = f( Win Loss ERA K SH Saves Team)

Equation 3: Estimated Cy Young Predictor

Where:

10 Bill James/Rob Neyer pg.46811 Bill James/Rob Neyer pg.47112 Bill James/Rob Neyer pg.47113 Dave Studeman pg. 114 Dave Studeman pg. 1

6

Page 7: Cy Young Predictor

CY = Cy Young Award points at the end of the year (Based on weighted voting totals)Win = Wins recorded by pitcherLoss = Losses recorded by pitcherERA = Earned run average recorded by pitcher (Earned Runs/Nine Innings Pitched)K = Strikeouts recorded by pitcher in a given yearSH = Shutouts (Starts with at least 5 innings in which opposing team does not score)Saves = Relieved a pitcher with at most a three run lead and held lead until end of gameTeam = Dummy variable (1 represents a pitcher on a first place team and 0 otherwise)

Theoretically, the following relationships are expected:

Win (+) Win totals are a classical representation of a starting pitcher’s value and contribution to

team success. Although there are many other factors that affect a pitcher’s win record, such as the team’s offensive ability, ballpark effects, opposing lineups, and luck, wins are still a popular determinant of ability. An increase in win total would correspond with more votes and therefore more Cy Young Award points, resulting in a positive sign.

Loss (-) Loss totals are a classical representation of a starting pitcher’s failure and inability to

contribute to a team’s success. Similarly to win totals, there are many other factors which affect a pitcher’s loss record. An increase in loss total would correspond with fewer Cy Young Award points, resulting in a negative sign.

ERA (-) A pitcher’s earned run average is the number of earned runs allowed per nine innings

pitched. This represents the number of runs a pitcher would allow if he were to pitch a complete ballgame. Therefore, a lower ERA is preferable. The earned run average is a good indicator of a pitcher’s ability because it removes the factor of his team’s offensive ability, and concentrates on the pitcher’s ability to stop the opposing team from scoring runs. An increase in ERA would correspond with fewer Cy Young Award points, resulting in a negative sign.

K (+) High strikeout totals are a good indicator of a pitcher’s skill because they represent a

pitcher’s ability independent of the quality of their defense or the dimensions of the ballpark. Strikeouts are highly valued because they do not allow runners to advance and remove the possibility of fielding error. An increase in strikeout total would correspond with more Cy Young Award points, resulting in a positive sign.

SH (+) Shutout totals represent the number of dominant performances in which the pitcher is

able to stop the opposing team from scoring for an entire game. The player must pitch at least five innings, but is not required to pitch the entire game. These accomplishments denote a large contribution to the team’s success on those days and symbolize a pitcher’s potential and ability. An increase in shutout total would correspond with more Cy Young Award points, resulting in a positive sign.

Saves (+)

7

Page 8: Cy Young Predictor

Save totals represent a player’s ability in late-game, pressure situations. Saves are usually only accrued by closers, who are groomed to only pitch in the final inning, and only if their team has amassed at most a three run lead. The final inning is just as significant as every other inning, so saves are sometimes believed to be overvalued. The value of closers and saves are probably a result of a human psychological fear of a last second comeback. Nevertheless, saves are one of the few classical statistics available for valuing relief pitchers. An increase in save total would correspond with more Cy Young Award points, resulting in a positive sign.

Team (+) Baseball is a team game, and a pitcher’s ability to contribute to a winning team is taken

into consideration by Cy Young Award voters. If a player is on a first place team, he will receive more Cy Young Award points, resulting in a positive sign.

Data Overview

If Cy Young points are used as the dependent variable, then the sample size must only include pitchers who have received votes. In order to obtain a sizeable sample size, while only using pitchers with Cy Young points, pitchers from different years must be included in the data. Recent Cy Young data is more complete, with a breakdown of allocated points as well as vote tallies for all those who received votes. Therefore, data from 2005-2008 were used for the data.

The Figure 1 and 2 represent a breakdown of the Cy Young Award points over the course of time in the data set.

Figure 1 Cy Young Award Points

8

Average Points: First Place 127.25Average Points: Second Place 69.00Average Margin of Victory 58.25

Page 9: Cy Young Predictor

Figure 2: Average Cy Young Points

The breakdown displays an average of 69 points for second place finishers, but this value does not represent the average required to overcome in order to win the award. As there is an equal number of Cy Young Award points distributed each year, a lower first place total would probably correspond with a higher second place total. This theory is displayed in 2007, and inversely explained in 2006. Therefore, the second place total is usually has a negative correlation with the first place total.

In order to evaluate pitchers, statistics from 2005-2008 were used. As there is an award given for each league, the data will only be taken from the American League. The following graphs display two popular statistics for predicting the Cy Young.

Figure 3: Cy Young Points by Strikeouts

Figure 4: Cy Young Points by Wins

9

Page 10: Cy Young Predictor

Figures 3 and 4 do not display a visible correlation between wins or strikeouts with Cy Young Points. The inclusion of relief pitchers will skew the data for wins and strikeouts because they do not pitch as many innings as a starter and are less likely to accumulate such large totals.

Correlations and Descriptions

Figure 5 displays the plots of the correlations between the variables.

Figure 5: Plotted Variable Correlations

There are not many strong correlations which can be observed in Figure 5. There seems to be a positive correlation between Win and Loss and between Loss and ERA, but a correlation table would need to be examined in order to verify these assumptions.

10

Page 11: Cy Young Predictor

Figure 6 displays the correlation coefficients and P-values associated with all the variables present in the equation.

CY Win Loss ERA K SH Save Team

CY 1.00000 0.36561 -0.10882 -0.09745 0.32576 0.37427 -0.11146 0.33724 0.0511 0.5742 0.6150 0.0846 0.0455 0.5649 0.0736

Win 0.36561 1.00000 0.42462 0.72602 0.51068 0.49262 -0.88722 0.03182 0.0511 0.0217 <.0001 0.0046 0.0066 <.0001 0.8698

Loss -0.10882 0.42462 1.00000 0.53583 0.38093 0.32549 -0.56881 -0.08649 0.5742 0.0217 0.0027 0.0415 0.0849 0.0013 0.6555

ERA -0.09745 0.72602 0.53583 1.00000 0.34635 0.44069 -0.80945 -0.10349 0.6150 <.0001 0.0027 0.0657 0.0167 <.0001 0.5932

K 0.32576 0.51068 0.38093 0.34635 1.00000 0.36259 -0.59667 -0.08716 0.0846 0.0046 0.0415 0.0657 0.0532 0.0006 0.6530

SH 0.37427 0.49262 0.32549 0.44069 0.36259 1.00000 -0.45365 0.30012 0.0455 0.0066 0.0849 0.0167 0.0532 0.0134 0.1137

Save -0.11146 -0.88722 -0.56881 -0.80945 -0.59667 -0.45365 1.00000 0.15495 0.5649 <.0001 0.0013 <.0001 0.0006 0.0134 0.4222

Team 0.33724 0.03182 -0.08649 -0.10349 -0.08716 0.30012 0.15495 1.00000 0.0736 0.8698 0.6555 0.5932 0.6530 0.1137 0.4222

Figure 6: Correlation Table

None of the independent variables are highly correlated with the dependent variable, CY. Additionally, many of the P-values are high, indicating that many of these values are not statistically significant. This problem can possible be remedied by increasing the sample size so to include more years.

Win has a high correlation with many other independent variables. This is understandable considering the crossover that most statistics have in baseball. This represents a high possibility for multicollinearity. Unfortunately, wins are taken into consideration too heavily by voters to eliminate from the equation. This aspect is represented by one of the best correlation and P-values when comparing Win to CY.

There are a few issues with the data which can be easily explained, but are more difficult to remedy. Many of the correlations between independent variables do not have the sign which would be expected. According to Figure 6, Win has a strong positive correlation with ERA. Theoretically, pitchers with lower ERAs should record higher win totals. This would suggest a hypothesized negative correlation between these independent variables. This can be explained by the presence of relievers in the data set. Relievers consistently record the lowest ERA totals, but do not record wins because of their role. Their value is derived from saves as opposed to win totals. This presents an issue within the data set, but fortunately does not affect the relationships between the dependent and independent variables.

11

Page 12: Cy Young Predictor

Equations and Tests

The equation with calculated coefficients and t-values are displayed below.

Equation 4

CY = -61.68 + 8.68Win – 1.87Loss – 31.14ERA + 0.26K + 4.42SH + 1.68Save + 7.66Team (3.23) (-0.68) (-1.93) (1.63) (1.52) (1.54) (0.50)

R2 = .6043Dof = 21

For a two-sided t-test with a 0.5 alpha level, the critical value for 21 degrees of freedom is 2.08. This results in insignificant values for almost all the independent variables. This suggests the presence of irregular variables. If the variables with the low t-values are eliminated, a more accurate equation can be estimated.

Equation 5

CY = 78.31 + 6.45Win – 53.04ERA + 5.76SH (3.72) (-4.00) (2.12)

R2 = 0.5017Dof = 25

Equation 5 has a new critical level of 2.05 at an alpha level of 0.5 when using a two-sided t-test. This equation is much better because all of the independent variables are statistically significant. Unfortunately equation 5 will never predict a reliever winning the award because it does not include saves. Relievers did not score well in recent voting, but if the sample size was increased to include other years, the data for saves might become more viable.

Omitted and Irrelevant Variables

A possible issue with equation 5 is that the coefficients have significantly changed from their values in equation 4. This suggests that one of variables should not have been omitted. An omitted relevant variable can cause specification bias which will lead to inaccurate estimated coefficients.15 It violates Classical Assumption III which “assumes that the explanatory variables are independent of the error term”16. Specification bias caused by an omitted relevant variable can be tested by reemitting a variable and observing the change in the estimated coefficients. If the coefficients change by more than ten percent, than the variable was relevant and should be reemitted.

Theoretically, the most obvious choice for resubmission is saves. Closers derive almost all of their value from their save totals and if the variable is omitted, the equation would fail to explain why closers would receive votes. In the original equation, both strikeouts and saves have t-values that are close to being statistically significant. With the omission of the irrelevant

15 Studenmund pg. 16316 Studenmund pg. 164

12

Page 13: Cy Young Predictor

variables, Loss and Team, the coefficients of the reemitted variables could increase to be significant or close to significant.

Equation 6

CY = -79.17 + 9.52Win – 33.28ERA + 0.25K + 4.80SH + 2.05Save (3.95) (-2.15) (1.61) (1.83) (2.12)

R2 = 0.5909Dof = 23

Equation 6 has a new critical value of 2.07 at an alpha level of 0.5 when using a two-sided t-test. The coefficients for Win and ERA have changed more than 10%. This suggests that the variables SH and Save were omitted variables. The only issue with this equation is that some of the independent variables are still statistically insignificant at the current alpha level. Testing for an irrelevant variable and removing it might improve the t-values of the variables. Theoretically, the most obvious irrelevant variable would be shutouts. Shutouts do not explain the value of closers, as closers never record a shutout. Shutouts also do not strongly display explain the value of starters either. If shutouts are removed from the equation, the remaining t-values should increase to a statistically significant level.

Equation 7

CY = -107.62 + 10.53Win – 27.92ERA + 0.30K + 2.27Save (4.28) (-1.75) (1.91) (2.26) R2 = 0.5311Dof = 22

Equation 7 has a critical value of 2.07 at an alpha level of 0.5 when using a two-sided t-test. The coefficients of the remaining variables have not changed more than 10%. This suggests that the removed variable, SH, is irrelevant. Additionally, all the t-values are now closer to the critical value. There are still two values which are too low to be statistically significant. To remedy this problem, the critical value can be altered by conducting a one-sided test as opposed to the two-sided test. The use of a one-tailed test must be justified theoretically. In baseball, statistics that reflect positively or negatively on the contributions of a player are very clear if the statistics are properly understood. This would suggest that the only one tail needs to be considered and all of alpha can be used on that side.

A one-sided t-test would yield a critical value of 1.72 at an alpha level of 0.5. All of the t-values are significant when a one-tailed test is conducted.

Multicollinearity

13

Page 14: Cy Young Predictor

“Multicollinearity is a violation of the classical assumption that states that no independent variable is a perfect linear function of one or more independent variables”17. Multicollinearity could pose a problem with this data set due to the high correlation between baseball statistics. Many of the statistics used to evaluate pitching performance are indirectly related to one another. A starting pitcher with a low ERA will win many games for their team, while a relief pitcher who records high save totals will not record many wins. Additionally the high R2 totals and low t-values, would suggest the presences of multicollinearity.

Consequences of Multicollinearity

1. “Estimates will remain unbiased”18.2. “The variance and standard errors of the estimates will increase”19. Coefficients come

from distributions with much larger variances and therefore larger standard errors.3. Lower t-values20

In the data set, Win has a strong positive correlation between ERA and K. Saves has a strong negative correlation with Win, ERA, and K. In order to test for multicollinearity, a VIF test must be preformed. VIF is an acronym for variance inflation factor and score above five suggests multicollinearity.

Variable R2 VIFWin 0.6035 2.5221ERA 0.5279 2.1182K 0.2621 1.3552

The only variable that showed strong multicollinearity after the VIF test was Save. There are three options for handling multicollinearity in a data set.

1. Do nothing: Multicollinearity will not always reduce dramatically t-values 2. Drop the redundant variable3. Increase sample size: A large sample set will create more accurate estimates and lower

the variance of the estimated coefficients. This would decrease the impact of multicollinearity.

Saves are too indicative of the value of closers and are too influential to the decisions of Cy Young voters to remove from the equation. Increasing the sample size might provide a good solution, but not changing the equation might also be a viable option. Multicollinearity is unavoidable when using a variable such as Saves. Relievers will always record lower ERAs, 17 Studenmund pg. 24518 Studenmund pg. 25019 Studenmund pg 25120 Studenmund pg. 251

14

Variable R2 VIFSave 0.8763 8.0841Win 0.7878 4.7125ERA 0.6843 3.1676K 0.4113 1.6987

Page 15: Cy Young Predictor

strikeouts, and win totals than starters. This will provide a strong negative correlation between saves and the other independent variables. There are also not many closers in the data set, due to the fact that only one or two closers receive votes each year. This small sample size of saves is inevitable and would require far more than doubling the size of the sample in order to remove the element of multicollinearity. Additionally, the presence of multicollinearity has not dramatically affected the t-values for equation 7. All variables in the this equation are statistically significant. Increasing the sample size and eliminated the variable are not productive options, so this issue of multicollinearity will be ignored.

Serial Correlation

“Pure serial correlation occurs when the classical assumption IV is violated which assumes uncorrelated observations of the error term.”21 A positive or negative value for p “indicates that the error term tends to have the same sign from one time period to the next.”22 This would be positive or negative serial correlation.23 A p of zero would indicate that no serial correlation exists.24 Although originally intended to be a time series, the sample size was too small, so data from different years were analyzed together to form a cross-sectional data set. Pure serial correlation is usually found within a time series data set, so it can be hypothesized that pure serial correlation will not be an issue with cross-sectional data. Alternatively, impure serial correlation is “caused by a specification error such as an omitted variable or an incorrect functional form.”25 The effect of omitting variables can change the error term and cause impure serial correlation.

Consequences of Serial Correlation

1. “Pure serial correlation does not cause bias in the coefficient estimates”26 Although there is no bias with pure serial correlation, impure serial correlation can produce biased coefficient estimates.27

2. “Serial correlation causes OLS estimates to no longer be the minimum variance estimators”28

3. “Serial correlation causes the OLS estimates of the SE(B^)’s to be biased, leading to unreliable hypothesis testing.”29 This can also produce unreliable t-values.30

A Durbin-Watson d test must be performed in order to test for serial correlation. The Durbin-Watson d test determines if there is serial correlation “in the error term of an equation b examining the residuals of a particular estimator of that equation.”31 Figure 7 displays the “du”

21 Studenmund pg. 31422 Studenmund pg. 31523 Studenmund pg. 31524 Studenmund pg. 31525 Studenmund pg. 31726 Studenmund pg. 32227 Studenmund pg. 32328 Studenmund pg. 32229 Studenmund pg. 32230 Studenmund pg. 32331 Studenmund pg. 325

15

Page 16: Cy Young Predictor

and “dl” for the estimated equation based on the number of variables and the sample size Figure 8 is used to interpret the Durbin-Watson d statistic based on the values of “du” and “dl”.

Figure 7 Durbin Watson d Statistic Calculation

The critical d values are dL = 1.12 and dU = 1.74.

H0: p ≤ 0HA: p > 0

If d < 1.12 Reject H0 If d > 1.74 Do Not Reject H0

If 1.12 ≤ d ≤ 1.74 Inconclusive

SC Inconclusive No SC No SC Inconclusive SC_________________________________________________________

0 dl du 2 4-du 4-dl 4

Figure 8: Durbin Watson d Statistic Interpretation Tool

The Durbin-Watson d statistic for the estimated equation is 1.663. Using Figure 8, the

Durbin-Watson d statistic shows inconclusive serial correlation because it is between 1.12 and 1.74. It is not clear if serial correlation is present or not present.

Heteroskedasticity

“Heteroskedasticity is the violation of Classical Assumption V, which states that the observations of the error term are drawn from a distribution that has a constant variance.”32 Theoretically, this data set contains an indicator that would suggest that heteroskedasticity is not present. Heteroskedasticity is more common in data sets where there are large variances in the observations of the dependent variable.33 The observations of the error term will no longer have a constant variance. In this data set, the minimum value of the dependent variable observations is constant at one. When comparing the different years, the maximum value of the observations only vary from 140 to 115. The data should still be tested for heteroskedasticity because it is very commonly found in cross-sectional data.34

32 Studenmund pg. 34633 Studenmund pg. 35534 Studenmund pg. 346

16

k n dU dL

4 29 1.12 1.74

Page 17: Cy Young Predictor

The residuals can be plotted, and signs of heteroskedasticity can be observed through their relationship. An expanding range of residuals indicates the presence of heteroskedasticity.35

Figure 9 Plot of Residuals

Figure 9 does not display an expanding range of residuals. This would suggest that there is no heteroskedasticity present, but a Park Test can also be conducted to test the data. In the Park Test, if β2 ≠  0, heteroskedasticity exists in the equation.

Park lnei2 = β1 + β2lnXi + Vi

Ho: β = 0

Ha: β ≠ 0

Figure 10: Park Test Results

A p-value above 0.05 is statistically insignificant. As all of the p-values in Figure 10 are above 0.05, the null hypothesis can be accepting. It can then be concluded that according to the Park Test, heteroskedasticity is not present in the data set.

35 Studenmund pg. 255

17

Variable P-ValueWin 0.59ERA 0.86K 0.48Save 0.81

Page 18: Cy Young Predictor

Alternative Testing

An appropriate test of the functionality of equation 7, would be to input data from other years and see if the equation correctly predicts the winner. Data from the American League between 2001 and 2004 will be used for this test.

CY = -107.62 + 10.53Win – 27.92ERA + 0.30K + 2.27Save (4.28) (-1.75) (1.91) (2.26) R2 = 0.5311Dof = 22

Figure 11 displays the data for the top four actual winners and their Cy Young point totals as well as the estimated top four winners, calculated using the statistics from that year.

2001 First Place 2001 Second Place 2001 Third Place 2001 Fourth Place

Yi

Roger Clemens 122.0

Mark Mulder 60.0

Freddy Garcia 55.0

Jamie Moyer 12.0

Y^iRoger Clemens 176.5

Mark Mulder 170.7

Mike Mussina 155.3

Freddy Garcia 153.3

2002 First Place 2002 Second Place 2002 Third Place 2002 Fourth Place

Yi

Barry Zito114.0

Pedro Martinez96.0

Derek Lowe41.0

Jarrod Washburn1.0

Y^iBarry Zito220.0

Pedro Martinez219.2

Derek Lowe187.2

Jarrod Washburn143.3

2003 First Place 2003 Second Place 2003 Third Place 2003 Fourth Place

Yi

Roy Halladay 136.0

Esteban Loaiza 63.0

Pedro Martinez 20.0

Tim Hudson 15.0

Y^iEsteban Loaiza 202.3

Roy Halladay 202.1

Pedro Martinez 147.2

Tim Hudson 141.7

2004 First Place 2004 Second Place 2004 Third Place 2004 Fourth Place

Yi

Johan Santana 140.0

Curt Schilling 82.0

Mariano Rivera 27.0

Pedro Martinez 1.0

Y^iJohan Santana 217.2

Curt Schilling191.0

Mariano Rivera128.1

Pedro Martinez127.7

Figure 11: Observed and Estimated Cy Young Results from 2001-2004

The estimated point totals should not be used to compare to the actual point totals. This is because the actual point totals are skewed. As there are a limited number of points to allocate each year, the actual individual point totals are influenced by the other candidates. Conversely, the estimated point totals are only based on each individual’s performance. This results in a smaller variance in point totals for the estimated results. The estimated point totals are very useful because, unlike the actual point totals, they can be used to compare winners from different

18

Page 19: Cy Young Predictor

years. For example, the equation shows that the best pitcher in any individual year was Barry Zito in 2002, when comparing all of the pitching performances across the four years. With a larger sample size, this equation could predict the most dominant pitching performance of all time.

The estimated equation predicts the first place winner with 75% accuracy when tested with four different years. It also predicts all place finishers up to fourth place with 75% accuracy. This is a positive indication that the equation has appropriately hypothesized the variables, and correctly estimated their signs and coefficients. The only year that the winner was incorrectly estimated was in 2003. Esteban Loaiza was estimated to win with a point total of 202.3 over Roy Halladay with an estimated point total of 202.1. This error was very slight considering the estimated point total of Halladay was only 0.1% lower than the estimated point total of Loaiza.

Errors

There are many errors present in the data set which have not yet been addressed in the analysis.

1. The dependent variable, Cy Young points, is dependent on the competition from other candidates in a given year.

2. Relievers and starters derive their value from different statistics3. Some statistics were omitted because they only have a relevant contribution to the value

of the dependent variable if an extraordinary number of them are recorded.

There are a limited number of Cy Young points which can be allocated among pitchers each year. This results in very good pitching performances receiving a very low point total simply because many other pitchers also performed well in that year. In another year, that very same performance might have received a much point total. This will affect the accuracy of the data because the value of the dependent variable for each observation is partially dependent on the value of the dependent variable of other observations.

The fact that starters and relievers derive their value from different statistics has been a problem throughout the analysis. One example of this issue can be found by observing the correlation between the independent variables, Win and ERA. From a theoretical perspective, a lower ERA should result in more wins. This would result in a strong negative correlation. The data is skewed because of the inclusion of relievers who will have the lowest ERAs, but also the lowest win totals. It is possible that the inclusion of relievers has affected the accuracy of the coefficients so that relievers are undervalued because of a lack of high win or strikeout totals. This may also benefit the accuracy of the estimated equation because it could reflect an actual preference for starters prevalent in Cy Young voters. This preference is supported by the fact that only four relievers have won the Cy Young award in the American League since its induction in 1956.36

36 Baseballreference.com

19

Page 20: Cy Young Predictor

One of the main issues with the final equation is that it has a low R2 value. An R2 of 0.53 signifies that the independent variables explain only 53% of the variance of the dependent variable. The R2 lowered slightly every time a variable was removed for being statistically insignificant or irrelevant. The remaining unexplained variance can be attributed to minor statistical categories which are not always considered in Cy Young voting. Only when a pitcher records a significant total, will these minor statistics become relevant. For example, if a pitcher records a significant number of complete games, this statistic might be considered in the voting, but not otherwise. These statistics are therefore difficult to use in a linear equation that estimates the Cy Young winner in any year.

Conclusion and Results

The purpose of this analysis was to construct an equation that could estimate the Cy Young Award winner. Although originally intended to be used with final season statistics, this equation could be used to reveal the current leader in the Cy Young race at any point in the season. The current American League Cy Young race is displayed in Figure 12, using data from the beginning of the season through 5/7/09. These results were calculated by inputting the season statistics as of 5/7/09 into the estimated equation. Statistics from the top three leaders in each category that corresponded to a dependent variable were used as data.

2009 First Place 2009 Second Place 2009 Third Place 2009 Fourth Place

Y^iZack Greinke -39.4

Frank Francisco -72.8

Roy Halladay -123.1

Jonathan Papelbon -129.7

Figure 12: Prediction of 2009 Cy Young Winner Based on Season Statistics as of May 7th 2009

The estimated point totals are negative because the equation is based on final season statistics. At this point in the season, the leaders can be identified by their proximity to zero. As the season progresses and they accumulate statistics with higher numerical values, their estimated point total will increase in the positive direction.

The final equation is very comparable to the Cy Young predictor which was created by Bill James. His equation has been rewritten in equation 8, so that the variable nomenclature matches that of the estimated equation.

Equation 8: Bill James Cy Young Predictor

CY = 6Win – 2Loss + 0.08K + 2.5Save + SH + R + 12Team

Equation 9: Estimated Cy Young Predictor

CY = -107.62 + 10.53Win – 27.92ERA + 0.30K + 2.27Save (4.28) (-1.75) (1.91) (2.26) R2 = 0.5311Dof = 22

20

Page 21: Cy Young Predictor

There are some interesting differences between the two equations. Instead of using ERA, James uses runs saved, denoted by R in his equation. Runs saved is a very effective replacement because the statistic adjusts based on competition. This will result in more accurate predictions because actual Cy Young points are based on the level of competition that year. This is an important element that was not captured in the estimated equation.

James uses a lower coefficient for wins and strikeouts and includes a variable for losses. This would increase the estimated value of reliever statistics because relievers record less wins, strikeouts, and losses. This would result in high estimated Cy Young point totals for relievers. This is balanced by the use of runs saved, which increases the estimated value of starters when compared to the use of ERA. Starters record higher totals of runs saved because they pitch a greater number of innings. In comparison, the use of ERA would favor relievers who usually record lower averages.

Both equations have excellent success rates when predicting the Cy Young winner. While James’ equation has a 6% superior success rate, he does not specify its ability to predict the runner-ups accurately. In order to make to make an appropriate comparison, the estimated equation must be tested with the same range of data. If data was tested for every year since the induction of the award in 1956, the success rate of the estimated equation might be closer in proximity.

21

Page 22: Cy Young Predictor

Bibliography

Jim Albert. “An Introduction to Sabermetrics” http://www-math.bgsu.edu/~albert/papers/saber.html

Baseball Reference.com, http://www.baseball-reference.com/awards/mvp_cya.shtml

Bill James and Rob Neyer. The Neyer/James Guide to Pitchers. New York: Fireside, 2003

Ben McGrath. “The Professor of Baseball”. New York: The New Yorker, 2003http://www.newyorker.com/archive/2003/07/14/030714fa_fact1

Dave Studeman. “2004 Win Shares Have Arrived”. The Hardball Times, 2004http://www.hardballtimes.com/main/article/2004-win-shares-have-arrived

A.H. Studenmund. Using Econometrics. Pearson Education Inc. 2006

22