The following is adapted from an analysis that Chris Bull (MBA ’18), Nehal Mehta (MBA ’18), and I completed as part of an MIT Sloan’s Analytics Edge course. Thanks to them for their help with the analyses, presentation, and write-up.
The MLB free agency problem—how teams effectively target the right set of players on the open market for the right price—is an extremely important and costly one. In 2017, the total contract value of free agents signed was greater than $1.4 billion, a staggering number that actually represents a decrease from the $2.4 billion of total contract value signed in 2016. These contract values represent a per team average of more than $45 million in 2017 and $80 million in 2016, a significant annual investment made by each team to acquire player services. Teams incur considerable additional expense to evaluate player talent in hopes of uncovering players they can target in free agency for below-market rates. As such, acquiring the right mix of talent at the right value is an important and costly task that has critical implications for the success or failure of each franchise.
The Greater Goal
Given the above, we leveraged historic player performance, biographical, and salary data in order to accurately predict what contract value a player will command on the free agent market and what factors are driving that value. In particular, we have focused our analysis on pitchers because our hypothesis was that the market has historically overvalued certain performance measures for pitchers that do not reflect the player’s true contribution to a team’s success (i.e., wins), and therefore the pitching market has some degree of inefficiency. Gaining insight into what drives free agent contract values, a team can then identify and target future free agents who might be undervalued by the market, given this bias towards inappropriate measures.
Our primary player performance data source was the Sean Lahman Baseball Database, which is widely regarded as one of the most comprehensive and reputable publicly available data sources in sports; Lahman incorporates baseball statistics dating back to the late 19th century on everything from individual performance measures to end-of-season awards voting. Free agent contract data was compiled from three sources; for years 2006-2010, the data was pulled from ESPN; for year 2011, the data was pulled from MLB Trade Rumors; and for years 2012-2017, the data was pulled from SPOTRAC. Finally, opening day payroll information was pulled from The Baseball Cube.
In the Lahman Database, players are assigned a unique identifier, which normally takes the form “first five letters of last name” + “first two letters of first name” + identifier number (which differentiates players with similar names). Using this identifier, we were able to map salary information and performance statistics to the free agent data, creating a single, comprehensive dataset which would form the base of our analysis.
Model Data Creation
Prior to modeling, we made a number of critical data transformations. These transformations can be broadly categorized as relating either to the individual free agent salary data, to individual measures from the Lahman database, to comparable players, or to payroll size.
Individual Free Agent Salary Data: First, to control for contract length, we calculated the average annual value (AAV) of each contract by dividing total contract size by the duration of the deal. (AAV is a crude, undiscounted dollar figure that treats all future cash flows as the same. Yet we believe it is appropriate to use in this context, given that we don’t have a lens into an appropriate discount rate.) Next, nominal AAV values were inflation adjusted using the Consumer Price Indices (CPI) from the Bureau of Labor Statistics (BLS); this ensured that all salary information was represented in 2017 real dollar terms. Third, non-pitcher and minor league deals were removed from the data in order to isolate Major League pitchers only. The following 641 free agents from 2006-2017 remained in our sample.
Finally, we log-transformed the AAV variable because of the presence of a number of outlier AAV deals each year. The below pictures present the distribution of nominal AAV and log-transformed AAV by year.
Individual Measures from the Lahman Database: We created rate statistics from count statistics (using total batters faced as the denominator) in order to control for differences in aggregate output between free agents. We also calculated other metrics such as ERA, WHIP, and Fielding Independent Pitching (FIP). For each of these measures, we incorporated time series lags for each of the two years prior to free agency to account for a pitcher’s most recent performance heading into his free agent winter. In addition to these single-season lagged variables, we created cumulative counts for number of innings thrown in a pitcher’s career, total All-Star appearances, and cumulative wins. We added an indicator variable to separate relief pitchers from starters, and we calculated each pitcher’s age at the midpoint of the season immediately following their free agent winter. Finally, we added each player’s log-transformed salary in the previous season; if previous year salary was not available in the Lahman database, we used the last salary recorded for that player.
Comparable Players: For each player-year observation in the database, we modeled the supply of comparable free agents in that year by calculating the Euclidean distance between a given free agent and every other free agent in the market. The analysis was done separately for starting pitchers and relief pitchers. For starters, distance was calculated based on age, lagged wins, and lagged FIP; for relievers, distance was calculated based on age, lagged saves, and lagged FIP. The threshold distance for “comparable” pitchers was chosen based on a sensitivity analysis whereby we inspected the distribution of total comparables across years under varying threshold levels. The below picture presents the average comparables per person by year for both starters and relievers.
There are more relievers in the market generally, and clubs also view free agent relievers as more interchangeable than starters; thus, it is appropriate that relievers have more comparables than their starter counterparts on average.
Payroll Size: Payroll varies widely between clubs in Major League baseball. Thus, for each player-year, we added the opening day payroll size of the acquiring club for that year to reflect the differences in team spending habits.
Opening day payrolls were also inflation adjusted to represent 2017 real dollar terms.
Modeling Free Agent AAV
After gathering, cleaning, and transforming the data, we performed three continuous modeling techniques: linear regression, CART, and random forest. We trained our models using the free agent data from 2004-2014, and we held out 2015-2017 data in order to test our models’ out-of-sample performance. Based on our data and modeling, we found that the linear regression and random forest models had comparable accuracy, though the linear regression model was clearly more interpretable. Each method is described in greater detail below:
As noted above, linear regression offered the greatest combination of accuracy and interpretability. We initially started with a data set of 35 potential independent variables; however, we noticed a very high degree of correlation among certain variables (i.e., Wins and Innings Pitched), so we excluded certain highly-correlated variables to reduce multicollinearity. We next removed insignificant variables one-by-one based on significance and intuition, resulting in the model included below.
As we hypothesized, recent opportunity-based performance measures (wins and saves) as well as factors that are more reflective of a pitcher’s individual talent (strikeout percentage and walk percentage) are both heavily weighted by the market. Interestingly, including lagged FIP, a pitcher’s weight, and/or a pitcher’s handedness in the model did not improve the performance. The cumulative measures (wins, All-Star appearances, total innings pitched) also were not helpful.
Our linear regression model performs fairly well out-of-sample, with an out-of-sample Root Mean Squared Error (RMSE) of 0.61 ($1.84M), an out-of-sample Mean Absolute Error (MAE) of 0.48 ($1.62M), and an out-of-sample R-squared (OSR-squared) of 0.61. The below picture presents a scatter plot of predicted versus actual contract values for the 191 players in our test set.
As shown above, our model performs adequately with small to medium AAV players, but it tends to under predict AAV for bigger money pitchers (especially starters). That our model systematically underestimates expensive starting pitchers means that we likely are not accurately capturing how teams financially differentiate between starters and relievers; for instance, teams deeply value a starter’s durability—which we have not explicitly included here—in addition to his performance metrics.
In the interests of delivering the most interpretable possible model which could serve as a simple guide for managerial decision-making, we next built a CART model. Given a set of user-defined parameters, a CART model builds a tree by splitting the data on certain independent variables. CART models are highly interpretable given that they provide simple rules to determine the prediction; the output is a decision tree-type picture like the below.
As expected, our CART output proved to be highly intuitive, with premiums paid for pitchers with more than 8.5 wins and high strike out percentages. (Outcomes in the tree below are log AAV.)
After the initial split on last year’s wins, premiums were paid for closers, and large market teams also evinced the ability to pay higher premiums for free agents. Predictably, despite this highly intuitive result, the CART model fared poorly during validation on the test set, yielding an out-of-sample RMSE of 0.72 ($2.05M), an out-of-sample MAE of 0.59 ($1.80M), and an OSR-squared value of 0.45.
Since our goal was predictive accuracy, we elected to fit a random forest model to our data to assess whether or not we could increase our accuracy over a simple linear regression. At a high level, random forest models take advantage of the “wisdom of crowds”: the idea that the predictions of a group outperform those from any one person. A random forest model is a combination of many CART models; each CART model is trained with a random subset of observations and variables from the training data, and each fitted CART tree makes a prediction given the values of the independent variables in its subset. Because each CART model is trained on different subset of observations and variables, each model uncovers slightly different patterns. When combined, random forest has the ability to find complex patterns in the data that a simple linear model would have missed. The one big drawback to random forest is the lack of interpretability; a random forest model is more or less a black box that often delivers very accurate predictions.
We started with the same set of data above and conducted out-of-bag cross-validation on our training data to determine the appropriate MTRY value. (The MTRY value controls the number of variables examined at each split of the fitted CART trees.) Based on this analysis, we determined an MTRY of 10 variables resulted in the lowest mean absolute error, so that was the value we used in our final random forest model.
Similar to the linear regression model, the random forest model assigned high value to recent opportunity-based measures of performance, including last year’s salary, wins, innings pitched, and saves. The top eight measures are included in the table below:
Our random forest model was fitted with an out-of-sample RMSE of 0.62 ($1.86M), an out-of-sample MAE of 0.50 ($1.65M), and an OSR-squared of 0.60.
Comparison of Models
The below table depicts a consolidated comparison of the out-of-sample performance for each of the three models.
Interestingly, the linear regression delivered the highest OSR-squared and lowest error values of all three potential models. The CART model’s performance likely suffered on account of overfitting to the training data; this result was manifest despite multiple adjustments to the breakdown between training and test population sets. While intuition may have suggested that the random forest model would always outperform the other two options, our empirical observation suggests that the model struggled to improve on the linear model on account of the relatively small data training set. (Random forest would almost certainly have outperformed the linear model if we had millions of data points.)
Just for fun, let’s take a look at some of the best and worst predictions using the coefficients from the linear model. These players are from the test data set, so their data was not used in model estimation.
Interestingly, we were very close on two of the highest paid starters in the game: Zack Grienke and David Price. This is despite systematically underpredicting expensive starters, which can be seen in the fairly horrible predictions for Jeff Samardzija, Wei-Yin Chen, Johnny Cueto, and John Lackey.
Challenges in Modeling Free Agent Pitcher Salaries
Forecasting MLB player salaries is a challenging proposition that we believe is complicated by four main factors. First, team plans and market forces are unclear and difficult to model. As constructed, our model fails to account for team need, which means we fail to capture a team’s willingness to pay. Similarly, many teams plan to trade for players or target players next year instead of pursuing free agents this year, which impacts the competition for player resources. To refine the model, we would seek to better account for team needs and market characteristics, though the appropriate approach to take remains unclear.
Second, become teams pay for future performance, the model should incorporate performance projections rather than actual observed performance. Though this creates a “model of models” scenario, each team’s willingness-to-pay is based on their expectation of how a player will perform. Of course, the historical performance indicators are baselines for projections, but they deviate from projections in that they do not include the vital regression to the mean component. To better reflect what the market value is likely to be for a player, we would seek to incorporate some form of consensus projections to address this.
Third, the market for pitchers is not an independent marketplace. Teams have limited resources to spend, and money spent on non-pitching staff clearly impacts the resources available to pay free agent pitchers. Our model currently fails to account for these measures, so we would seek to model their impact in a future forecast by incorporating data reflecting the size and competitiveness of the overall free agent market—not just the free agent market for pitchers.
Finally, the huge influx of cash from local TV deals has perhaps translated into salary inflation above and beyond the CPI that we used to inflation-adjust our salary data. Because this is a fairly recent development, there may be a structural break between the data we used to fit the model (years 2006-2014) and the data after that period. A sharper model might better account for baseball-specific inflation rather than using the CPI only.
Though the specified MLB pitcher salary model has room for improvement, it provides a reasonably accurate prediction of the average annual contract value a MLB pitcher is likely to command upon entering free agency. Using this prediction output—and more importantly understanding the factors driving the market’s value in our model—teams can identify strong performers who are expected to command lower prices, most likely those who have lower opportunity-based performance measures like wins or saves. We would recommend teams with limited budgets leverage our model assessing projected cost alongside more traditional scouting analysis to focus their limited resources on potentially undervalued free agents that address team needs.