Predicting Home Runs with Python

Home Runs are, most fans would agree, the most exciting part of baseball. An interesting question we might want to know, is can we predict how many home runs a player has/ will get based on the numbers they are putting up offensively? Thanks to the power of Python and machine learning we can approximate a player's home run total as well as gain insights as to what features contribute most to this number.

Notebook

Data Source

Load in Data

First, we need to load the data into a Pandas DataFrame.

import pandas as pd 

df = pd.read_csv(os.listdir()[2])

df.head()

Year	Name	Position	Team	G	AB	R	H	2B	3B	HR	RBI	BB	SO	SB	CS	AVG	OBP	SLG	OPS
2022	Aaron Judge	CF	NYY	157	570	133	177	28	0	62	131	111	175	16	3	0.311	0.425	0.686	1.111
2022	Yordan Alvarez	DH	HOU	135	470	95	144	29	2	37	97	78	106	1	1	0.306	0.406	0.613	1.019
2022	Paul Goldschmidt	1B	STL	151	561	106	178	41	0	35	115	79	141	7	0	0.317	0.404	0.578	0.982
2022	Jose Altuve	2B	HOU	141	527	103	158	39	0	28	57	66	87	18	1	0.300	0.387	0.533	0.920
2022	Freddie Freeman	1B	LAD	159	612	117	199	47	2	21	100	84	102	13	3	0.325	0.407	0.511	0.918

Feature Names

CF: Center Field
RF: Right Field
1B: First Base
2B: Second Base
3B: Third Base
C: Catcher
SS: Short Stop
DH: Designated Hitter

Games Played: Games in which a player has appeared.

At Bats: Trips to the plate that do not result in a walk, hit by pitch, sacrifice or reach by interference.

Runs: When a baserunner successfully reaches home plate and scores a point.

Hits: When a baller reaches base safely on a fair ball unless the batter is deemed by the official scorer to have reached on an error or a fielder's choice.

Double: When a baserunner reaches second base safely and stops.

Triple: When a baserunner reaches third base safely and stops.

Home Run: When a hitter hits the baseball over the barrier that separates the field from the rest of the stadium. The player must touch every base including home plate.

RBI

Runs Batted In: Runs that score because of the batter's safe hit, sac blunt, sac fly, infield out or fielder's choice or is forced to score by a bases-loaded walk, hit batter, or interference.

Walks: When a player is awarded first base after four balls or an intentional walk. Players hit by pitches are not included in this.

Strikeout: When a player is called out by three strikes.

Stolen Base: When a baserunner advances to the next base while the batter is still at the plate.

Caught Stealing: When a baserunner attempts to steal a base and is tagged out.

AVG

Batting Average: The percentage of time a hitter gets on base compared to how many times they are at the plate.

OBP

On-Base Percentage: The rate at which a batter reaches base in his plate appearances.

SLG

Slugging Percentage: The rate of total bases per bat.

OPS

On-Base Plus Slugging: The sum of on-base percentage and slugging percentage.

Data Exploration

Let's get a basic feel for the data and see if we can uncover any interesting things before we get to modeling.

import matplotlib.pyplot as plt 
from matplotlib import colors

fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(10, 8), tight_layout=True)

ax1.hist(df['AVG'], color='navy')
ax1.set_ylabel('Frequency')
ax1.set_xlabel('AVG')
ax1.set_title('Average Hitting')

ax2.hist(df['OPS'], color='orange')
ax2.set_xlabel('On-Base Plus Slugging')
ax2.set_ylabel('Frequency')
ax2.set_title('OPS')

violin = ax3.violinplot(df['HR'], showmeans=True, showmedians=True)

for ii in violin['bodies']:
    ii.set_color('black')
    ii.set_edgecolor('indigo')
    ii.set_alpha(1)

ax3.set_title('Home Runs')

N, bins, patches = ax4.hist(df['OBP'])
fracs = N / N.max()
norm = colors.Normalize(fracs.min(), fracs.max())

for thisfrac, thispatch in zip(fracs, patches):
    color = plt.cm.winter(norm(thisfrac))
    thispatch.set_facecolor(color)

ax4.set_xlabel('On Base Percentage')
ax4.set_ylabel('Frequency')
ax4.set_title('OBP')

plt.show()

All these figures are close to normal distributions, which will help us make some more sense of how these numbers factor into predicting home runs.

We can see that most players:

Are batting an average between .225 and .300
Have an OPS (On-Base-Plus-Slugging) between 0.7 and 0.9
Have an OBP (On-Base-Percentage) between 0.30 and 0.40
Get around 18 home runs a season

Does player position factor into the equation?

fig, (ax1, ax2) = plt.subplots(1, 2, tight_layout=True)

ax1.barh(df['Player Position'], df['R'])
ax1.set_xlabel('Runs')
ax1.set_ylabel('Player Position')

ax2.barh(df['Player Position'], df['RBI'])
ax2.set_xlabel('RBI')
ax2.set_ylabel('Player Position')

plt.show()

Feature Engineering

Let's see what other useful statistics can be found by taking information from our columns and making new columns.

For example, it might be practical to know how often a player is getting a run based on the number of hits they have. We can assume the better their run-to-hit ratio is the better their hitting accuracy and power might be. This could be significant in determining home runs, although we won't know how much until we construct a correlation matrix.

run_to_hit_ratio = df.R / df.H

run_to_hit_ratio

0      0.751412
1      0.659722
2      0.595506
3      0.651899
4      0.587940
         ...   
534    0.575472
535    0.508772
536    0.472000
537    0.603448
538    0.463636

Another potentially useful feature could be a player's strikeout rate. That is what percentage of the time is a player striking out when at the plate?

strikeout_rate = df['SO'] / df['AB']

strikeout_rate

0      0.307018
1      0.225532
2      0.251337
3      0.165085
4      0.166667
         ...   
534    0.303534
535    0.234000
536    0.235887
537    0.276471
538    0.220648

We might infer that the more a player is striking out, the more they are swinging the bat, which could be a factor in determining home runs.

Let's take each player's batting average, on-base percentage, slugging percentage, and on-base-plus-slugging percentage and multiply these together to get what we'll consider an overall "offensive score".

df['AVG * OBP * SLG * OPS'] = df['AVG'] * df['OBP'] * df['SLG'] * df['OPS']

Off Score
0.100737
0.077604
0.072691
0.056931
0.062050

Correlation Matrix

Let's make a correlation matrix with Seaborn to get a better visual representation of the relationship between features.

import seaborn as sns

corr_matrix = df_temp.corr()

fig, ax = plt.subplots(figsize=(15, 10))

ax = sns.heatmap(corr_matrix, fmt='.2f', cmap='gist_heat', annot=True, linewidths=0.5)

We can see from this matrix that

Shades of red: indicate no correlation between the variables
Shades of orange/yellow: indicates a moderate positive correlation
Shades of peach/white: indicates a strong positive correlation
Shades of dark red/black: indicates a negative correlation

When we look at the HR row we can see that most of the variables correlate with home runs. The only features that do not correlate with home runs are player name, player position and team name. This makes sense as these things are all inconsequential and don't have any bearing on what we're trying to predict.

Build a Machine Learning Model

Now we're ready to start making predictions. First, we need to split the data into X and y (features and target variable)

# Split data into x and y 
X = df_temp.drop('HR', axis=1)
y = df_temp['HR']

Then we split the data into training and test sets using scikit-learn.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.20,
                                                    random_state=101)

Make sure the sets are in the right shape

X_train.shape, y_train.shape

((431, 22), (431,))

X_test.shape, y_test.shape

((108, 22), (108,))

We split the data into 80% training and 20% testing, which should be a good split for this scenario. Generally, we want to have 20-25% of the data go to the test set.

Let's make a function to train a variety of models and see which performs the best. Typically ensemble estimators will give us the best results, which includes catboost.

The problem we are dealing with is regression (estimating a number). So we need to make sure we use the appropriate method, as all these models have classification models that work similarly.

from sklearn.ensemble import AdaBoostRegressor, ExtraTreesRegressor, RandomForestRegressor, GradientBoostingRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import LinearSVR
from catboost import CatBoostRegressor

np.random.seed(666)

models = {'AdaBoost': AdaBoostRegressor(),
          'KNN': KNeighborsRegressor(),
          'Support Vector Machine': LinearSVR(),
          'ExtraTrees': ExtraTreesRegressor(),
          'Random Forest': RandomForestRegressor(),
          'GradientBoosting': GradientBoostingRegressor(),
          'CatBoost': CatBoostRegressor(verbose=0)}


def train_models(models: dict, X_train, X_test, y_train, y_test):

    model_train_scores = {}
    model_test_scores = {}

    for model_name, model in models.items():
        print(f'Fitting model {model_name}')
        model.fit(X_train, y_train)

        model_train_scores[model_name] = model.score(X_train, y_train)
        model_test_scores[model_name] = model.score(X_test, y_test)

    return model_train_scores, model_test_scores


train_models(models, X_train, X_test, y_train, y_test)

Training results

({'AdaBoost': 0.9286631713397986,
  'KNN': 0.8037141965782321,
  'Support Vector Machine': 0.8724454407508145,
  'ExtraTrees': 1.0,
  'Random Forest': 0.9881160143302705,
  'GradientBoosting': 0.9902382319119076,
  'CatBoost': 0.9997550603425505}

Test results

{'AdaBoost': 0.8599337431816676,
  'KNN': 0.7002678885753402,
  'Support Vector Machine': 0.8933242774809239,
  'ExtraTrees': 0.9169202022854908,
  'Random Forest': 0.9060005342125115,
  'GradientBoosting': 0.9365180308137456,
  'CatBoost': 0.9284641736813508})

The test results are always the most important metric to look at because the best models perform well on data they have never seen before.

The best-performing model on the test data is the Gradient Boosting estimator. Let's evaluate its metrics further and see if we can improve the model via hyperparameter tuning.

rom sklearn.metrics import r2_score, mean_squared_error, mean_squared_log_error, mean_absolute_error

def model_metrics(model):

    model_preds = model.predict(X_test)
    true_preds = y_test

    scores = {'R2': r2_score(y_true=true_preds,
                             y_pred=model_preds),
              'Mean Squared Error': mean_squared_error(y_true=y_test,
                                                       y_pred=model_preds),
              'Mean Absolute Error': mean_absolute_error(y_true=y_test,
                                                         y_pred=model_preds),
              'Mean Squared Log Error': mean_squared_log_error(y_true=y_test,
                                                               y_pred=model_preds)}

    return scores

models = {'Random Forest': RandomForestRegressor(),
          'Gradient Boosting': GradientBoostingRegressor(),
          'catboost': CatBoostRegressor(verbose=0)}

for model_name, model in models.items():
    print(f'Fitting {model_name}')
    model.fit(X_train, y_train)
    print(f'Scoring {model_name}')
    print(model_metrics(model=model))

Fitting Random Forest
Scoring Random Forest
{'R2': 0.9100736945042049, 'Mean Squared Error': 12.417254629629626, 'Mean Absolute Error': 2.5387962962962956, 'Mean Squared Log Error': 0.11507279609462101}
Fitting Gradient Boosting
Scoring Gradient Boosting
{'R2': 0.937831934651002, 'Mean Squared Error': 8.584325721087914, 'Mean Absolute Error': 2.17413905571074, 'Mean Squared Log Error': 0.08356007553293097}
Fitting catboost
Scoring catboost
{'R2': 0.9284641736813508, 'Mean Squared Error': 9.877850153436622, 'Mean Absolute Error': 2.057505305063898, 'Mean Squared Log Error': 0.095040599925313}

We can see that GradientBoostingRegressor() outperforms the other models in some of the major regression metrics. The default metric for regression models is R2 (Coefficient of Determination). If every prediction a model makes is the mean of the target (home runs) then it will have an R2 score of 0.0. An R2 score of 1.0 is a perfect score, so our base model is doing pretty well with default parameters, achieving an R2 score of around 0.937.

Hyperparameter Tuning

We can see the adjustable parameters of our model by calling the method get_params

More info on Gradient Boost can be found Here.

GradientBoostingRegressor().get_params()

{'alpha': 0.9,
 'ccp_alpha': 0.0,
 'criterion': 'friedman_mse',
 'init': None,
 'learning_rate': 0.1,
 'loss': 'squared_error',
 'max_depth': 3,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_iter_no_change': None,
 'random_state': None,
 'subsample': 1.0,
 'tol': 0.0001,
 'validation_fraction': 0.1,
 'verbose': 0,
 'warm_start': False}

Let's tune our model with RandomizedCV. This method uses cross-validation, which is a handy feature that gives our model different versions of randomized data sets, ensuring that our model is not cheating so to speak by introducing as much randomness and unpredictability as possible.

from sklearn.model_selection import RandomizedSearchCV

param_grid = {'n_estimators': np.arange(100, 1000, 50),
              'loss': ['squared_error', 'absolute_error', 'huber', 'quantile'],
              'criterion': ['friedman_mse', 'squared_error'],
              'learning_rate': [0.1, 0.01, 0.05],
              'min_samples_leaf': np.arange(1, 12, 1),
              'max_depth': np.arange(3, 11, 1)}

grad_boost_model_tuned = RandomizedSearchCV(estimator=grad_boost_model,
                                            param_distributions=param_grid,
                                            n_iter=25,
                                            cv=5,
                                            verbose=True)
# 25 iterations of 5-fold cross-validation = 125 different model params tested
grad_boost_model_tuned.best_params_

{'n_estimators': 500,
 'min_samples_leaf': 10,
 'max_depth': 4,
 'loss': 'squared_error',
 'learning_rate': 0.05,
 'criterion': 'squared_error'}

grad_boost_model_tuned.score(X_test, y_test)

0.9498886589964061

model_metrics(model=grad_boost_model_tuned)

{'R2': 0.9498886589964061,
 'Mean Squared Error': 6.91950233742141,
 'Mean Absolute Error': 1.8524247042392998,
 'Mean Squared Log Error': 0.08025189012399532}

We managed to improve our model's metrics. Let's make a prediction on real-world data using Shohei Ohtani.

shohei_ohtani = np.array([2023.0, 1.0, 6.0, 15.0, 132.0, 489.0, 101.0, 150.0, 25.0, 8.0, 95.0, 85.0, 141.0, 19.0 ,6.0, 0.307, 0.410, 0.661, 1.071, 0.6733333333333333, 0.2883435582822086, 0.08910727496999998
])

# Make sure the data is in the same shape as what the model was trained on
shohei_ohtani.reshape(1, -1)

array([[2.02300000e+03, 1.00000000e+00, 6.00000000e+00, 1.50000000e+01,
        1.32000000e+02, 4.89000000e+02, 1.01000000e+02, 1.50000000e+02,
        2.50000000e+01, 8.00000000e+00, 9.50000000e+01, 8.50000000e+01,
        1.41000000e+02, 1.90000000e+01, 6.00000000e+00, 3.07000000e-01,
        4.10000000e-01, 6.61000000e-01, 1.07100000e+00, 6.73333333e-01,
        2.88343558e-01, 8.91072750e-02]])

# Predict how many home runs Ohtani has
grad_boost_model_tuned.predict(shohei_ohtani.reshape(1, -1))

array([41.5397646])

Rounding up our model predicts around 42 home runs for Ohtani. As of the coding of this model Ohtani had 44 home runs.

Feature Importances

Lastly, we'll look at how the model ranked the importance of the features presented which should remind us of the correlation matrix.