Predictive Model – The Evolution of the Major League Baseball Contract

The final step in analyzing contracts is to use statistics for prediction purposes. Using the computer program SAS, I have attempted to create a linear regression model that predicts the money a player might receive.

As we saw on the “Data and Relationships” page, there is a big difference between contracts handed out in the seventies and eighties and those earned in the twenty-first century. The difference is so big that I didn’t think it was feasible to construct a linear model spanning from 1977 to 2023. Instead, I opted to split the time up into sixteen-year chunks. Each model includes predictors that SAS deemed to be significant. Furthermore, any predictor not in the model is considered insignificant.

1977-1992

Hitters

Pitchers

In the Hitters model, we see that age, free agent year, WAR3, OPS, and RBI in the contract year were all significant predictors for the contract a hitter would receive. All these predictors were deemed significant under an alpha level of .05. The adjusted R^2 value for the model is 0.3906, which means that about 39% if the variability in the data can be explained by the model.

The Pitchers model contains age, free agent year, WAR3, contract year WAR, and contract year home runs per nine innings. They were all deemed significant under an alpha level of 0.15. The model has an adjusted R^2 value of 0.6598., meaning that roughly two-thirds of the variability in the data can be explained by the model.

1993-2008

Hitters

Pitchers

For this time frame, the Hitters model contains the predictors age, free agent year, WAR3, career batting average, and contract year home runs. The model also includes the question of if the player is a shortstop. Therefore, we conclude that shortstops received significantly different contracts from the rest of the hitters. These predictors were significant under an alpha level of 0.01, and the model carries an adjusted R^2 value of 0.6165.

The Pitchers model found only three predictors to be significant: age, free agent year, and WAR3. All were significant under an alpha less than 0.0001. None of the other variables could be deemed significant even after bumping the significance level up to 0.2. As a result of having so few predictors, the adjusted R^2 value is 0.5086, meaning that the model only accounts for about 50% of the variability in the data.

2009-2023

Hitters

Pitchers

The Hitters model for 2009-2023 again contains the indicator variable for whether a player is a shortstop, suggesting that shortstops earn a significantly different amount of money during free agency. The model also includes age, free agent year, WAR3, career OBP, hits in the contract year, and OPS in the contract year. All these predictors are significant under the 0.01 significance level. The model has an adjusted R^2 value of 0.7341. In other words, the model accounts for about 73% of the data’s variability.

A look at the Pitchers model shows a very interesting fact: free agent year, present in every other model, is excluded. This means that in this time frame, contracts for pitchers have not significantly changed over the fifteen-year period. The model does include career WHIP, WAR3, wins in the contract year, and strikeouts and walks per nine innings in the contract year. 82.5% of the variability in the data can be explained by the model, and all the predictors were significant under an alpha level of 0.1.

Judging by adjusted R^2 values, these models have differing levels of success. The Pitchers model for 2009-2023 is pretty solid. A 0.825 adjusted R^2 value is pretty good, especially for modeling a situation as complicated as this one. On the other hand, the models with adjusted R^2 values of 0.5086 and 0.3906 are simply not good at modeling contract amounts. They are useful for seeing how changing stats might change the outcome, but in terms of actually predicting a total dollar amount, they are pretty unrealistic.

Why might this be? Three reasons come to mind.

Contract Length. One of the biggest driving forces behind the total dollar amount of a salary is the length of the deal. Think about it. A player who signs a 1 year, $20 million deal is most likely better than one who inks a 3 yr, $21 million deal. The first player is making $20 million per season while the other earns $7 million. However, without accounting for length, it seems like the latter player is better because he earns $1 million more than the former. Unfortunately, contract length is not a predictor variable. Just like the total pay, it is a function of a player’s performance and age. Heck, I could do this whole thing again, only focus on length instead of money. So while contract length no doubt contributes to the pay, it cannot be accounted for in a predictive model.
Incomplete Data. While I believe that my method of selecting observations was the best way to do it, it wasn’t perfect. The reality is that my data were a miniscule sample of all the contracts out there, and including every single one would be the ideal way to do it. That’s a tall task, and a bit impractical for an eight-week period. I surmise that my method of selecting the top 10 free agents by WAR3 overrepresented the bigger free agent deals and underrepresented the less noteworthy ones. I also excluded any players who went from the big leagues to signing a minor league contract. It’s difficult to quantify those minor league deals, especially when bonuses or incentives are factored in. For more on the collection of data, visit the Methodology section.
Lack of independence among the predictors. All the stats are different ways of reflecting the same outcome. For instance, say Babe Ruth hits a double. That means Ruth’s hits total increases by one, and so does his doubles total. His at-bats increase by one, as do his plate appearances. His batting average, on-base percentage, and slugging percentage all jump, and therefore his on-base + slugging percentage rises too. He increases his WAR, and many other statistics change as well. This all happens from one double. My point is that many of the metrics relied on to build a model are strongly influenced by other metrics. WAR has a (really complicated) formula, but at the end of the day, it all comes down to the basic counting stats like hits, strikeouts, walks, etc.. OPS is literally the summation of two other percentages. When choosing the predictors for the model, I tried to eliminate as much of this issue, multicollinearity, as possible. I didn’t include any career counting stats like games played, hits, or wins, because those are all dependent on age. OPS and OPS+ are extremely similar, so OPS+ was left out. Total bases is simply a function of hits, doubles, triples, and homers, so it didn’t make the cut either. But it’s impossible to fully get rid of this multicollinearity problem, and so the perfect model this is not.

The bottom line is that it’s tough to build a model when all the predictors hinge on each other like they do in this case. I am convinced, however, that we can still learn things from these models, and with continued work, they can become better predictors of free agency.