The singular goal in baseball is to get more runs than the opposing team. But that goal can be reached in a multitude of different ways. Like many other sports, baseball relies heavily on statistics to find what strategies are effective and which are not. These strategies are not only used to pick winning plays but also allow coaches to pick successful players to make their team stronger. The data that we are anaylizing, the Lahman 2015 baseball database, includes statistics that range from 2015 back until 1871. It includes information about batting and pitching for both individual players and for the teams. Our goal with this anaylsis is to determine what variables make a team successful, where success is defined as a win-loss ratio greater than 1. Strategies in baseball change and players are contasntly retiring, so our goal was to look at the years that are comparable to today, so we focused on data from the year 2006 to the year 2015. Additionally, we are going to be specifically looking at team statistics, rather than specific players, to determine success.
To begin to get an idea of what makes a team successful, we explored what would arguably be the most successful teams in baseball: those that won the World Series. And what we thought would be the indicators of a good team: number of runs and number of hits. And then we compared them to the teams that got the most runs and the most hits.
import sqlite3
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import statsmodels.api as sm
from IPython.display import Image
import warnings
warnings.filterwarnings('ignore')
# Reads database
baseball_con = sqlite3.connect("lahman2015.sqlite")
# World Series winners with their total number of hits and runs
WSWinners = pd.read_sql_query("SELECT yearID as 'Year', name as 'WS Winner', R as 'Total WS Runs', SUM(H) as 'Total WS Hits' "
"FROM Teams WHERE (yearID > 2005 AND yearID < 2016) AND WSWin = 'Y' GROUP BY yearID"
, baseball_con)
# Teams that had the most runs each season
max_runs = pd.read_sql_query("SELECT yearID as 'Year', name as 'Max Runs Team', MAX(R) as '# of Runs' "
"FROM Teams WHERE (yearID > 2005 AND yearID < 2016) GROUP BY yearID LIMIT 10", baseball_con)
# Teams that had the most hits each season
max_hits = pd.read_sql_query("SELECT yearID as 'Year', name as 'Max Hits Team', MAX(H) as '# of Hits' "
"FROM Teams WHERE (yearID > 2005 AND yearID < 2016) GROUP BY yearID", baseball_con)
# Merge together so its easier to analyze
max_runs_hit = max_runs.merge(max_hits, how = 'outer', on = 'Year')
WS_max = WSWinners.merge(max_runs_hit, how = 'left', on = 'Year')
WS_max
When comparing the World Series winners to the teams that had the most hits and runs, it is clear that the teams that had the most runs or the most hits in a given season were not necessarily the World Series winners. There are only two years that the World Series winner was also the team with the maximum number of runs: the New York Yankees in 2009 and the Boston Red Sox in 2013. And the World Series winner was never the team with the maximum number of hits.
Interestingly though, many of the teams that had the maximum number of runs in a season also were the teams that had the maximum number of hits. This makes sense because to be able to get runs, the players have to be willing to hit the ball in the first place.
Baseball strategy can be broken down into three categories: offense, defense, and pitching. From what we saw above, offense (including the number of hits and runs) alone does not make a team successful. To get a better picture of what makes a successful team, we explored additional defensive and pitching variables:
Defense:
Pitching:
To better look at how the variables that we chose affect a team's success, we created a table that includes each variable averaged over our time period (2006-2015) for each team. Additionally, we included the average number of losses and wins, calculated the average win-loss ratio for each team.
# Create a table to get an average on all the variables
avg_table = pd.read_sql("SELECT name as Team, franchID, AVG(H) as 'Average Hits', AVG(R) as 'Average Runs', AVG(FP) as 'Average Fielding %', "
"AVG(DP) AS 'Average Double Plays', AVG(SHO) AS 'Average Shutouts', AVG(HR) AS 'Average HR Allowed', "
"AVG(W) AS 'Average Wins', AVG(L) AS 'Average Losses' FROM Teams WHERE (yearID > 2005 AND yearID < 2016) "
"GROUP BY franchID ORDER BY Team", baseball_con)
# Get the win-loss ratio
avg_table['Win-Loss Ratio'] = avg_table['Average Wins']/avg_table['Average Losses']
avg_table
We expected our variables to correlate with the win-loss ratio as follows:
# heat map to look at the correlations between the variables
corr = avg_table.corr()
sns.heatmap(corr,
xticklabels=corr.columns.values,
yticklabels=corr.columns.values)
plt.show()
The above heatmap allows us to see the correlation between our variables. Specifically we are looking at the right most column in the first 6 rows. We see that hits, runs, fielding percentage, shutouts, and homeruns allowed are all positively correlated to the win-loss ratio. The only variable that is negatively correlated with the win-loss ratio is double plays. Most of these are what we expected but the average double plays and the number of homeruns allowed were surprising. To have a better look at the correlations, we plotted pairwise scatterplots of the same variables, but left out the average wins and losses to make it easier to understand.
#pairwise scatterplots to get a detailed look at our correlations
sns.pairplot(avg_table, vars=["Average Hits", "Average Runs", "Average Fielding %", "Average Double Plays", "Average Shutouts",
"Average HR Allowed", "Win-Loss Ratio"])
plt.show()
#Average Fielding % and Win-Loss Ratio correlation
sns.regplot(avg_table['Win-Loss Ratio'], avg_table['Average Fielding %'], ci = None)
sns.plt.title('Average Fielding Percentage vs Win-Loss Ratio')
plt.show()
Like we mentioned above, we plotted a pairwise scatterplot of our six variables and the win-loss ratio. Again, the column that we are specifically interested in is the right most column. The scale of the average fielding percentage was so small that the correlation did not show up in the scatterplot, so we individually plotted the scatterplot. We get the same results as the heat map above, but this scatterplot allows us to see the plots more clearly.
The very slight negative correlation between double plays and the win-loss ratio may because teams try to avoid getting errors (or misplaying a ball in a manner that allows a batter to advance one or more bases). Additionally, the surprising positive correlation between the number of homeruns allowed and the win-loss ratio may be due to a lack of good pitching but a good offensive team for successful teams.
The next thing that we wanted to do was model our data using linear regression. But before we do that, we wanted to figure out which of our 8 variables (including number of wins and number of losses) are actually needed in determining if a team is successful or not. As a reminder, our definition of success is a team that has a win-loss ratio that is greater than 1.
# adding a binomial 'success' column to our dataframe where 1 is a success and 0 is not
# the spaces in our are table were giving us issues so we created a new table with underscores for this purpose
avg_table1 = pd.read_sql("SELECT AVG(H) as 'Average_Hits', AVG(R) as 'Average_Runs', AVG(FP) as 'Average_Fielding_Perc', "
"AVG(DP) AS 'Average_Double_Plays', AVG(SHO) AS 'Average_Shutouts', AVG(HR) AS 'Average_HR_Allowed', "
"AVG(W) AS 'Average_Wins', AVG(L) AS 'Average_Losses' FROM Teams WHERE (yearID > 2005 AND yearID < 2016) "
"GROUP BY franchID ORDER BY name", baseball_con)
avg_table1['Win_Loss_Ratio'] = avg_table1['Average_Wins']/avg_table1['Average_Losses']
avg_table1.head()
# adding the success column to our dataframe where any ratio greater than 1 is a success and 0 otherwise
def f(row):
if row['Win_Loss_Ratio'] > 1:
val = 1
else:
val = 0
return val
avg_table1['Success'] = avg_table1.apply(f, axis=1)
# delete win-loss ratio, its repetative. Add intercept term for modeling
avg_table1 = avg_table1.drop('Win_Loss_Ratio', 1)
avg_table1['intercept'] = 1
avg_table1.head()
# found this def from a github repo, link to the site in Resources below
def forward_selected(data, response):
"""Linear model designed by forward selection.
Parameters:
-----------
data : pandas DataFrame with all possible predictors and response
response: string, name of response column in data
Returns:
--------
model: an "optimal" fitted statsmodels linear model
with an intercept
selected by forward selection
evaluated by adjusted R-squared
"""
remaining = set(data.columns)
remaining.remove(response)
selected = []
current_score, best_new_score = 0.0, 0.0
while remaining and current_score == best_new_score:
scores_with_candidates = []
for candidate in remaining:
formula = "{} ~ {}".format(response,
' + '.join(selected + [candidate]))
score = smf.ols(formula, data).fit().rsquared_adj
scores_with_candidates.append((score, candidate))
scores_with_candidates.sort()
best_new_score, best_candidate = scores_with_candidates.pop()
if current_score < best_new_score:
remaining.remove(best_candidate)
selected.append(best_candidate)
current_score = best_new_score
formula = "{} ~ {}".format(response,
' + '.join(selected))
model = smf.ols(formula, data).fit()
return model
We used forward selection in this cause because we did not want to over fit out model by using backward selection.
# Run forward selection on our data
model = forward_selected(avg_table1, 'Success')
print model.model.formula
From the forward selection function that we ran, we found that success is determined by average losses, average shutouts, average runs, and the average fielding percentage. This is intersting because it includes one variable from each of our categories, offense (number of runs), defense (feilding percentage) and pitching (number of shutouts), as well as a record variable (number of losses).
One of the issues that we ran into was perfect seperation. Here is an image that explains what perfect seperation is.
Image(filename = "cblrK.png")
Some possible reasons why we may have ran into this issue are:
One possible reason you run into prefect separation error is that you keep too many x variables. If you use many x variables, the combined value of x for the rows with y = 0 may gather in one area, and those with y = 1 may gather in another area, then prefect separation happens.
Another possibility is that one of the x variables causes the same problem as shown in the plot.
Maybe (1) and (2) together cause prefect separation, that is, you could split the data into two groups based on a subset of x variables.
We decided to use another function/package which would ignore perfect separation. It seems that if we use .fit(method = "bfgs"), perfect separation would be ignored. We believed that perfect separation is not a big deal because we could still do the prediction.
We chose to use a logistic regression model because our dependent variable "Success" is a binomial categorical variable, where there are only two possible outcomes, Y = 0 (failure) and Y = 1 (success).
# create list of predictor variables based on forward selection results
predictor_vars = avg_table1[['Average_Shutouts', 'Average_Fielding_Perc', 'Average_Losses', 'Average_Runs', 'intercept']]
# regress variables on Success using logistic regression
logit = sm.Logit(avg_table1['Success'], predictor_vars)
# fit the model
result = logit.fit(method = "bfgs")
print result.summary()
Our null hypothesis in this case was $H_{0}$: our model fits well vs $H_{A}$: our model does not fit well
The log-likelihood ratio test statistic p-value is 5.235e-09. This means that we would fail to reject the null hypthosis. This means that our model, that includes the average losses, average shutouts, average runs, and the average fielding percentage, fits well.
Our model is: $log_{\text{IT}} = 0.1901 + 23.5495X^{\text{shutouts}}_{1} + 0.1944X^{\text{fielding}}_{2} + (-6.3831)X^{\text{losses}}_{3} + 0.3780X^{\text{runs}}_{4}$
From our analysis, we can conclude that a team's success, where their win-loss ratio is greater than one, is determined by the team's average losses, average shutouts, average runs, and their average fielding percentage. From our logit model, $log_{\text{IT}} = 0.1901 + 23.5495X^{\text{shutouts}}_{1} + 0.1944X^{\text{fielding}}_{2} + (-6.3831)X^{\text{losses}}_{3} + 0.3780X^{\text{runs}}_{4}$, we can see that the average number of shutouts plays a significant role in determining if a team was successful or not because it has a large coefficient. This makes sense because this variable, while primarily a pitching variable, also can be combined with defense because you need both a strong bullpen and a strong defense in order to have many shutouts.
This brings us back to our initial analysis of the World Series winners. They did not necessarily have the best offense, but now we know that offense alone is not the key to being successful at baseball.
We recongize that this model may not be perfect because a teams win-loss ratio can fluctuate between each season. For example the Kansas City Royals won the World Series in 2015; however, their win-loss ratio would not be defined as "successful". This may be due to the fact that players get traded between teams and teams may switch up their strategies. Overall, we believe that our model does a good job of predicting a successful team.
Janice Luong: Github master and modeling
Sierra Tevlin: Initial analysis and written explanations
Worked together on SQL and plotting (coded together)