Predicting Home Runs with Python
Home Runs are, most fans would agree, the most exciting part of baseball. An interesting question we might want to know, is can we predict how many home runs a player has/ will get based on the numbers they are putting up offensively? Thanks to the power of Python and machine learning we can approximate a player's home run total as well as gain insights as to what features contribute most to this number.
Load in Data
First, we need to load the data into a Pandas DataFrame.
import pandas as pd
df = pd.read_csv(os.listdir()[2])
df.head()
Year | Name | Position | Team | G | AB | R | H | 2B | 3B | HR | RBI | BB | SO | SB | CS | AVG | OBP | SLG | OPS |
2022 | Aaron Judge | CF | NYY | 157 | 570 | 133 | 177 | 28 | 0 | 62 | 131 | 111 | 175 | 16 | 3 | 0.311 | 0.425 | 0.686 | 1.111 |
2022 | Yordan Alvarez | DH | HOU | 135 | 470 | 95 | 144 | 29 | 2 | 37 | 97 | 78 | 106 | 1 | 1 | 0.306 | 0.406 | 0.613 | 1.019 |
2022 | Paul Goldschmidt | 1B | STL | 151 | 561 | 106 | 178 | 41 | 0 | 35 | 115 | 79 | 141 | 7 | 0 | 0.317 | 0.404 | 0.578 | 0.982 |
2022 | Jose Altuve | 2B | HOU | 141 | 527 | 103 | 158 | 39 | 0 | 28 | 57 | 66 | 87 | 18 | 1 | 0.300 | 0.387 | 0.533 | 0.920 |
2022 | Freddie Freeman | 1B | LAD | 159 | 612 | 117 | 199 | 47 | 2 | 21 | 100 | 84 | 102 | 13 | 3 | 0.325 | 0.407 | 0.511 | 0.918 |
Feature Names
CF: Center Field
RF: Right Field
1B: First Base
2B: Second Base
3B: Third Base
C: Catcher
SS: Short Stop
DH: Designated Hitter
G
Games Played: Games in which a player has appeared.
AB
At Bats: Trips to the plate that do not result in a walk, hit by pitch, sacrifice or reach by interference.
R
Runs: When a baserunner successfully reaches home plate and scores a point.
H
Hits: When a baller reaches base safely on a fair ball unless the batter is deemed by the official scorer to have reached on an error or a fielder's choice.
2B
Double: When a baserunner reaches second base safely and stops.
3B
Triple: When a baserunner reaches third base safely and stops.
HR
Home Run: When a hitter hits the baseball over the barrier that separates the field from the rest of the stadium. The player must touch every base including home plate.
RBI
Runs Batted In: Runs that score because of the batter's safe hit, sac blunt, sac fly, infield out or fielder's choice or is forced to score by a bases-loaded walk, hit batter, or interference.
BB
Walks: When a player is awarded first base after four balls or an intentional walk. Players hit by pitches are not included in this.
SO
Strikeout: When a player is called out by three strikes.
SB
Stolen Base: When a baserunner advances to the next base while the batter is still at the plate.
CS
Caught Stealing: When a baserunner attempts to steal a base and is tagged out.
AVG
Batting Average: The percentage of time a hitter gets on base compared to how many times they are at the plate.
OBP
On-Base Percentage: The rate at which a batter reaches base in his plate appearances.
SLG
Slugging Percentage: The rate of total bases per bat.
OPS
On-Base Plus Slugging: The sum of on-base percentage and slugging percentage.
Data Exploration
Let's get a basic feel for the data and see if we can uncover any interesting things before we get to modeling.
import matplotlib.pyplot as plt
from matplotlib import colors
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(10, 8), tight_layout=True)
ax1.hist(df['AVG'], color='navy')
ax1.set_ylabel('Frequency')
ax1.set_xlabel('AVG')
ax1.set_title('Average Hitting')
ax2.hist(df['OPS'], color='orange')
ax2.set_xlabel('On-Base Plus Slugging')
ax2.set_ylabel('Frequency')
ax2.set_title('OPS')
violin = ax3.violinplot(df['HR'], showmeans=True, showmedians=True)
for ii in violin['bodies']:
ii.set_color('black')
ii.set_edgecolor('indigo')
ii.set_alpha(1)
ax3.set_title('Home Runs')
N, bins, patches = ax4.hist(df['OBP'])
fracs = N / N.max()
norm = colors.Normalize(fracs.min(), fracs.max())
for thisfrac, thispatch in zip(fracs, patches):
color = plt.cm.winter(norm(thisfrac))
thispatch.set_facecolor(color)
ax4.set_xlabel('On Base Percentage')
ax4.set_ylabel('Frequency')
ax4.set_title('OBP')
plt.show()
All these figures are close to normal distributions, which will help us make some more sense of how these numbers factor into predicting home runs.
We can see that most players:
Are batting an average between .225 and .300
Have an OPS (On-Base-Plus-Slugging) between 0.7 and 0.9
Have an OBP (On-Base-Percentage) between 0.30 and 0.40
Get around 18 home runs a season
Does player position factor into the equation?
fig, (ax1, ax2) = plt.subplots(1, 2, tight_layout=True)
ax1.barh(df['Player Position'], df['R'])
ax1.set_xlabel('Runs')
ax1.set_ylabel('Player Position')
ax2.barh(df['Player Position'], df['RBI'])
ax2.set_xlabel('RBI')
ax2.set_ylabel('Player Position')
plt.show()
Feature Engineering
Let's see what other useful statistics can be found by taking information from our columns and making new columns.
For example, it might be practical to know how often a player is getting a run based on the number of hits they have. We can assume the better their run-to-hit ratio is the better their hitting accuracy and power might be. This could be significant in determining home runs, although we won't know how much until we construct a correlation matrix.
run_to_hit_ratio = df.R / df.H
run_to_hit_ratio
0 0.751412
1 0.659722
2 0.595506
3 0.651899
4 0.587940
...
534 0.575472
535 0.508772
536 0.472000
537 0.603448
538 0.463636
Another potentially useful feature could be a player's strikeout rate. That is what percentage of the time is a player striking out when at the plate?
strikeout_rate = df['SO'] / df['AB']
strikeout_rate
0 0.307018
1 0.225532
2 0.251337
3 0.165085
4 0.166667
...
534 0.303534
535 0.234000
536 0.235887
537 0.276471
538 0.220648
We might infer that the more a player is striking out, the more they are swinging the bat, which could be a factor in determining home runs.
Let's take each player's batting average, on-base percentage, slugging percentage, and on-base-plus-slugging percentage and multiply these together to get what we'll consider an overall "offensive score".
df['AVG * OBP * SLG * OPS'] = df['AVG'] * df['OBP'] * df['SLG'] * df['OPS']
Off Score | ||
0.100737 | ||
0.077604 | ||
0.072691 | ||
0.056931 | ||
0.062050 |
Correlation Matrix
Let's make a correlation matrix with Seaborn to get a better visual representation of the relationship between features.
import seaborn as sns
corr_matrix = df_temp.corr()
fig, ax = plt.subplots(figsize=(15, 10))
ax = sns.heatmap(corr_matrix, fmt='.2f', cmap='gist_heat', annot=True, linewidths=0.5)
We can see from this matrix that
Shades of red: indicate no correlation between the variables
Shades of orange/yellow: indicates a moderate positive correlation
Shades of peach/white: indicates a strong positive correlation
Shades of dark red/black: indicates a negative correlation
When we look at the HR row we can see that most of the variables correlate with home runs. The only features that do not correlate with home runs are player name, player position and team name. This makes sense as these things are all inconsequential and don't have any bearing on what we're trying to predict.
Build a Machine Learning Model
Now we're ready to start making predictions. First, we need to split the data into X
and y
(features and target variable)
# Split data into x and y
X = df_temp.drop('HR', axis=1)
y = df_temp['HR']
Then we split the data into training and test sets using scikit-learn.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.20,
random_state=101)
Make sure the sets are in the right shape
X_train.shape, y_train.shape
((431, 22), (431,))
X_test.shape, y_test.shape
((108, 22), (108,))
We split the data into 80% training and 20% testing, which should be a good split for this scenario. Generally, we want to have 20-25% of the data go to the test set.
Let's make a function to train a variety of models and see which performs the best. Typically ensemble estimators will give us the best results, which includes catboost
.
The problem we are dealing with is regression (estimating a number). So we need to make sure we use the appropriate method, as all these models have classification models that work similarly.
from sklearn.ensemble import AdaBoostRegressor, ExtraTreesRegressor, RandomForestRegressor, GradientBoostingRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import LinearSVR
from catboost import CatBoostRegressor
np.random.seed(666)
models = {'AdaBoost': AdaBoostRegressor(),
'KNN': KNeighborsRegressor(),
'Support Vector Machine': LinearSVR(),
'ExtraTrees': ExtraTreesRegressor(),
'Random Forest': RandomForestRegressor(),
'GradientBoosting': GradientBoostingRegressor(),
'CatBoost': CatBoostRegressor(verbose=0)}
def train_models(models: dict, X_train, X_test, y_train, y_test):
model_train_scores = {}
model_test_scores = {}
for model_name, model in models.items():
print(f'Fitting model {model_name}')
model.fit(X_train, y_train)
model_train_scores[model_name] = model.score(X_train, y_train)
model_test_scores[model_name] = model.score(X_test, y_test)
return model_train_scores, model_test_scores
train_models(models, X_train, X_test, y_train, y_test)
Training results
({'AdaBoost': 0.9286631713397986,
'KNN': 0.8037141965782321,
'Support Vector Machine': 0.8724454407508145,
'ExtraTrees': 1.0,
'Random Forest': 0.9881160143302705,
'GradientBoosting': 0.9902382319119076,
'CatBoost': 0.9997550603425505}
Test results
{'AdaBoost': 0.8599337431816676,
'KNN': 0.7002678885753402,
'Support Vector Machine': 0.8933242774809239,
'ExtraTrees': 0.9169202022854908,
'Random Forest': 0.9060005342125115,
'GradientBoosting': 0.9365180308137456,
'CatBoost': 0.9284641736813508})
The test results are always the most important metric to look at because the best models perform well on data they have never seen before.
The best-performing model on the test data is the Gradient Boosting estimator. Let's evaluate its metrics further and see if we can improve the model via hyperparameter tuning.
rom sklearn.metrics import r2_score, mean_squared_error, mean_squared_log_error, mean_absolute_error
def model_metrics(model):
model_preds = model.predict(X_test)
true_preds = y_test
scores = {'R2': r2_score(y_true=true_preds,
y_pred=model_preds),
'Mean Squared Error': mean_squared_error(y_true=y_test,
y_pred=model_preds),
'Mean Absolute Error': mean_absolute_error(y_true=y_test,
y_pred=model_preds),
'Mean Squared Log Error': mean_squared_log_error(y_true=y_test,
y_pred=model_preds)}
return scores
models = {'Random Forest': RandomForestRegressor(),
'Gradient Boosting': GradientBoostingRegressor(),
'catboost': CatBoostRegressor(verbose=0)}
for model_name, model in models.items():
print(f'Fitting {model_name}')
model.fit(X_train, y_train)
print(f'Scoring {model_name}')
print(model_metrics(model=model))
Fitting Random Forest
Scoring Random Forest
{'R2': 0.9100736945042049, 'Mean Squared Error': 12.417254629629626, 'Mean Absolute Error': 2.5387962962962956, 'Mean Squared Log Error': 0.11507279609462101}
Fitting Gradient Boosting
Scoring Gradient Boosting
{'R2': 0.937831934651002, 'Mean Squared Error': 8.584325721087914, 'Mean Absolute Error': 2.17413905571074, 'Mean Squared Log Error': 0.08356007553293097}
Fitting catboost
Scoring catboost
{'R2': 0.9284641736813508, 'Mean Squared Error': 9.877850153436622, 'Mean Absolute Error': 2.057505305063898, 'Mean Squared Log Error': 0.095040599925313}
We can see that GradientBoostingRegressor()
outperforms the other models in some of the major regression metrics. The default metric for regression models is R2 (Coefficient of Determination). If every prediction a model makes is the mean of the target (home runs) then it will have an R2 score of 0.0. An R2 score of 1.0 is a perfect score, so our base model is doing pretty well with default parameters, achieving an R2 score of around 0.937.
Hyperparameter Tuning
We can see the adjustable parameters of our model by calling the method get_params
More info on Gradient Boost can be found Here.
GradientBoostingRegressor().get_params()
{'alpha': 0.9,
'ccp_alpha': 0.0,
'criterion': 'friedman_mse',
'init': None,
'learning_rate': 0.1,
'loss': 'squared_error',
'max_depth': 3,
'max_features': None,
'max_leaf_nodes': None,
'min_impurity_decrease': 0.0,
'min_samples_leaf': 1,
'min_samples_split': 2,
'min_weight_fraction_leaf': 0.0,
'n_estimators': 100,
'n_iter_no_change': None,
'random_state': None,
'subsample': 1.0,
'tol': 0.0001,
'validation_fraction': 0.1,
'verbose': 0,
'warm_start': False}
Let's tune our model with RandomizedCV.
This method uses cross-validation, which is a handy feature that gives our model different versions of randomized data sets, ensuring that our model is not cheating so to speak by introducing as much randomness and unpredictability as possible.
from sklearn.model_selection import RandomizedSearchCV
param_grid = {'n_estimators': np.arange(100, 1000, 50),
'loss': ['squared_error', 'absolute_error', 'huber', 'quantile'],
'criterion': ['friedman_mse', 'squared_error'],
'learning_rate': [0.1, 0.01, 0.05],
'min_samples_leaf': np.arange(1, 12, 1),
'max_depth': np.arange(3, 11, 1)}
grad_boost_model_tuned = RandomizedSearchCV(estimator=grad_boost_model,
param_distributions=param_grid,
n_iter=25,
cv=5,
verbose=True)
# 25 iterations of 5-fold cross-validation = 125 different model params tested
grad_boost_model_tuned.best_params_
{'n_estimators': 500,
'min_samples_leaf': 10,
'max_depth': 4,
'loss': 'squared_error',
'learning_rate': 0.05,
'criterion': 'squared_error'}
grad_boost_model_tuned.score(X_test, y_test)
0.9498886589964061
model_metrics(model=grad_boost_model_tuned)
{'R2': 0.9498886589964061,
'Mean Squared Error': 6.91950233742141,
'Mean Absolute Error': 1.8524247042392998,
'Mean Squared Log Error': 0.08025189012399532}
We managed to improve our model's metrics. Let's make a prediction on real-world data using Shohei Ohtani.
shohei_ohtani = np.array([2023.0, 1.0, 6.0, 15.0, 132.0, 489.0, 101.0, 150.0, 25.0, 8.0, 95.0, 85.0, 141.0, 19.0 ,6.0, 0.307, 0.410, 0.661, 1.071, 0.6733333333333333, 0.2883435582822086, 0.08910727496999998
])
# Make sure the data is in the same shape as what the model was trained on
shohei_ohtani.reshape(1, -1)
array([[2.02300000e+03, 1.00000000e+00, 6.00000000e+00, 1.50000000e+01,
1.32000000e+02, 4.89000000e+02, 1.01000000e+02, 1.50000000e+02,
2.50000000e+01, 8.00000000e+00, 9.50000000e+01, 8.50000000e+01,
1.41000000e+02, 1.90000000e+01, 6.00000000e+00, 3.07000000e-01,
4.10000000e-01, 6.61000000e-01, 1.07100000e+00, 6.73333333e-01,
2.88343558e-01, 8.91072750e-02]])
# Predict how many home runs Ohtani has
grad_boost_model_tuned.predict(shohei_ohtani.reshape(1, -1))
array([41.5397646])
Rounding up our model predicts around 42 home runs for Ohtani. As of the coding of this model Ohtani had 44 home runs.
Feature Importances
Lastly, we'll look at how the model ranked the importance of the features presented which should remind us of the correlation matrix.