What Makes A Baseball Team Successful?¶

Statistics 141B Final Project¶

By Janice Luong and Sierra Tevlin¶

Introduction¶

The singular goal in baseball is to get more runs than the opposing team. But that goal can be reached in a multitude of different ways. Like many other sports, baseball relies heavily on statistics to find what strategies are effective and which are not. These strategies are not only used to pick winning plays but also allow coaches to pick successful players to make their team stronger. The data that we are anaylizing, the Lahman 2015 baseball database, includes statistics that range from 2015 back until 1871. It includes information about batting and pitching for both individual players and for the teams. Our goal with this anaylsis is to determine what variables make a team successful, where success is defined as a win-loss ratio greater than 1. Strategies in baseball change and players are contasntly retiring, so our goal was to look at the years that are comparable to today, so we focused on data from the year 2006 to the year 2015. Additionally, we are going to be specifically looking at team statistics, rather than specific players, to determine success.

Initial Anaylsis¶

To begin to get an idea of what makes a team successful, we explored what would arguably be the most successful teams in baseball: those that won the World Series. And what we thought would be the indicators of a good team: number of runs and number of hits. And then we compared them to the teams that got the most runs and the most hits.

import sqlite3
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import statsmodels.api as sm
from IPython.display import Image
import warnings
warnings.filterwarnings('ignore')

# Reads database
baseball_con = sqlite3.connect("lahman2015.sqlite")

# World Series winners with their total number of hits and runs
WSWinners = pd.read_sql_query("SELECT yearID as 'Year', name as 'WS Winner', R as 'Total WS Runs', SUM(H) as 'Total WS Hits' "
                              "FROM Teams WHERE (yearID > 2005 AND yearID < 2016) AND WSWin = 'Y' GROUP BY yearID"
                              , baseball_con)

# Teams that had the most runs each season
max_runs = pd.read_sql_query("SELECT yearID as 'Year', name as 'Max Runs Team', MAX(R) as '# of Runs' "
                             "FROM Teams WHERE (yearID > 2005 AND yearID < 2016) GROUP BY yearID LIMIT 10", baseball_con)

# Teams that had the most hits each season
max_hits = pd.read_sql_query("SELECT yearID as 'Year', name as 'Max Hits Team', MAX(H) as '# of Hits' "
                             "FROM Teams WHERE (yearID > 2005 AND yearID < 2016) GROUP BY yearID", baseball_con)

# Merge together so its easier to analyze
max_runs_hit = max_runs.merge(max_hits, how = 'outer', on = 'Year')

WS_max = WSWinners.merge(max_runs_hit, how = 'left', on = 'Year')
WS_max

When comparing the World Series winners to the teams that had the most hits and runs, it is clear that the teams that had the most runs or the most hits in a given season were not necessarily the World Series winners. There are only two years that the World Series winner was also the team with the maximum number of runs: the New York Yankees in 2009 and the Boston Red Sox in 2013. And the World Series winner was never the team with the maximum number of hits.

Interestingly though, many of the teams that had the maximum number of runs in a season also were the teams that had the maximum number of hits. This makes sense because to be able to get runs, the players have to be willing to hit the ball in the first place.

Expanding the Defition of Success¶

Baseball strategy can be broken down into three categories: offense, defense, and pitching. From what we saw above, offense (including the number of hits and runs) alone does not make a team successful. To get a better picture of what makes a successful team, we explored additional defensive and pitching variables:

Defense:

Fielding Percentage - the percent of time a defensive player properly handles a ball, either thrown or batted

Number of Double Plays - the number of times that the defensive team gets two outs in one continuous play

Pitching:

Number of shutouts - the number of times a starting pitcher pitches an entire game and does not allow the opposing team to score once

Number of homeruns allowed - the number of homeruns allowed by the pitchers on a team

To better look at how the variables that we chose affect a team's success, we created a table that includes each variable averaged over our time period (2006-2015) for each team. Additionally, we included the average number of losses and wins, calculated the average win-loss ratio for each team.

# Create a table to get an average on all the variables
avg_table = pd.read_sql("SELECT name as Team, franchID, AVG(H) as 'Average Hits', AVG(R) as 'Average Runs', AVG(FP) as 'Average Fielding %', "
                        "AVG(DP) AS 'Average Double Plays', AVG(SHO) AS 'Average Shutouts', AVG(HR) AS 'Average HR Allowed', "
                        "AVG(W) AS 'Average Wins', AVG(L) AS 'Average Losses' FROM Teams WHERE (yearID > 2005 AND yearID < 2016) "
                        "GROUP BY franchID ORDER BY Team", baseball_con)

# Get the win-loss ratio
avg_table['Win-Loss Ratio'] = avg_table['Average Wins']/avg_table['Average Losses']
avg_table

We expected our variables to correlate with the win-loss ratio as follows:

Average hits will be positively correlated with the win-loss ratio because if a team is more willing to hit the ball, they will have more runs and will win more often

Average runs will be positively correlated with the win-loss ratio because if a team gets more runs, they have a higher chance of winning more games

Average fielding percntage will be positively correlated with the win-loss ratio because if a team can properly handle the ball, they will have a better change of winning games

Average number of double plays will be positively correlated with the win-loss ratio because a team that knows how to effectively get the opposing team out, will win more often

Average number of shutouts will be positively correlated with the the win-loss ratio because if a team has good pitching, they will likely be able to win more games

Average number of homeruns allowed will be negatively correlated with the win-loss ratio because if a team lets the opposing team score more often, they will be less likely to win games

# heat map to look at the correlations between the variables
corr = avg_table.corr()
sns.heatmap(corr,
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values)
plt.show()

The above heatmap allows us to see the correlation between our variables. Specifically we are looking at the right most column in the first 6 rows. We see that hits, runs, fielding percentage, shutouts, and homeruns allowed are all positively correlated to the win-loss ratio. The only variable that is negatively correlated with the win-loss ratio is double plays. Most of these are what we expected but the average double plays and the number of homeruns allowed were surprising. To have a better look at the correlations, we plotted pairwise scatterplots of the same variables, but left out the average wins and losses to make it easier to understand.

#pairwise scatterplots to get a detailed look at our correlations
sns.pairplot(avg_table, vars=["Average Hits", "Average Runs", "Average Fielding %", "Average Double Plays", "Average Shutouts",
                             "Average HR Allowed", "Win-Loss Ratio"])
plt.show()

#Average Fielding % and Win-Loss Ratio correlation 
sns.regplot(avg_table['Win-Loss Ratio'], avg_table['Average Fielding %'], ci = None)
sns.plt.title('Average Fielding Percentage vs Win-Loss Ratio')
plt.show()

Like we mentioned above, we plotted a pairwise scatterplot of our six variables and the win-loss ratio. Again, the column that we are specifically interested in is the right most column. The scale of the average fielding percentage was so small that the correlation did not show up in the scatterplot, so we individually plotted the scatterplot. We get the same results as the heat map above, but this scatterplot allows us to see the plots more clearly.

The very slight negative correlation between double plays and the win-loss ratio may because teams try to avoid getting errors (or misplaying a ball in a manner that allows a batter to advance one or more bases). Additionally, the surprising positive correlation between the number of homeruns allowed and the win-loss ratio may be due to a lack of good pitching but a good offensive team for successful teams.

Modeling¶

The next thing that we wanted to do was model our data using linear regression. But before we do that, we wanted to figure out which of our 8 variables (including number of wins and number of losses) are actually needed in determining if a team is successful or not. As a reminder, our definition of success is a team that has a win-loss ratio that is greater than 1.

# adding a binomial 'success' column to our dataframe where 1 is a success and 0 is not
# the spaces in our are table were giving us issues so we created a new table with underscores for this purpose
avg_table1 = pd.read_sql("SELECT AVG(H) as 'Average_Hits', AVG(R) as 'Average_Runs', AVG(FP) as 'Average_Fielding_Perc', "
                         "AVG(DP) AS 'Average_Double_Plays', AVG(SHO) AS 'Average_Shutouts', AVG(HR) AS 'Average_HR_Allowed', "
                         "AVG(W) AS 'Average_Wins', AVG(L) AS 'Average_Losses' FROM Teams WHERE (yearID > 2005 AND yearID < 2016) "
                         "GROUP BY franchID ORDER BY name", baseball_con)

avg_table1['Win_Loss_Ratio'] = avg_table1['Average_Wins']/avg_table1['Average_Losses']
avg_table1.head()

# adding the success column to our dataframe where any ratio greater than 1 is a success and 0 otherwise
def f(row):
    if row['Win_Loss_Ratio'] > 1:
        val = 1
    else:
        val = 0
    return val

avg_table1['Success'] = avg_table1.apply(f, axis=1)

# delete win-loss ratio, its repetative. Add intercept term for modeling
avg_table1 = avg_table1.drop('Win_Loss_Ratio', 1)
avg_table1['intercept'] = 1

avg_table1.head()

# found this def from a github repo, link to the site in Resources below

def forward_selected(data, response):
    """Linear model designed by forward selection.

    Parameters:
    -----------
    data : pandas DataFrame with all possible predictors and response

    response: string, name of response column in data

    Returns:
    --------
    model: an "optimal" fitted statsmodels linear model
           with an intercept
           selected by forward selection
           evaluated by adjusted R-squared
    """
    remaining = set(data.columns)
    remaining.remove(response)
    selected = []
    current_score, best_new_score = 0.0, 0.0
    while remaining and current_score == best_new_score:
        scores_with_candidates = []
        for candidate in remaining:
            formula = "{} ~ {}".format(response,
                                           ' + '.join(selected + [candidate]))
            score = smf.ols(formula, data).fit().rsquared_adj
            scores_with_candidates.append((score, candidate))
        scores_with_candidates.sort()
        best_new_score, best_candidate = scores_with_candidates.pop()
        if current_score < best_new_score:
            remaining.remove(best_candidate)
            selected.append(best_candidate)
            current_score = best_new_score
    formula = "{} ~ {}".format(response,
                                   ' + '.join(selected))
    model = smf.ols(formula, data).fit()
    return model

We used forward selection in this cause because we did not want to over fit out model by using backward selection.

# Run forward selection on our data
model = forward_selected(avg_table1, 'Success')

print model.model.formula

Success ~ Average_Losses + Average_Shutouts + Average_Runs + Average_Fielding_Perc + intercept

From the forward selection function that we ran, we found that success is determined by average losses, average shutouts, average runs, and the average fielding percentage. This is intersting because it includes one variable from each of our categories, offense (number of runs), defense (feilding percentage) and pitching (number of shutouts), as well as a record variable (number of losses).

One of the issues that we ran into was perfect seperation. Here is an image that explains what perfect seperation is.

Image(filename = "cblrK.png")

Some possible reasons why we may have ran into this issue are:

One possible reason you run into prefect separation error is that you keep too many x variables. If you use many x variables, the combined value of x for the rows with y = 0 may gather in one area, and those with y = 1 may gather in another area, then prefect separation happens.
Another possibility is that one of the x variables causes the same problem as shown in the plot.
Maybe (1) and (2) together cause prefect separation, that is, you could split the data into two groups based on a subset of x variables.

We decided to use another function/package which would ignore perfect separation. It seems that if we use .fit(method = "bfgs"), perfect separation would be ignored. We believed that perfect separation is not a big deal because we could still do the prediction.

We chose to use a logistic regression model because our dependent variable "Success" is a binomial categorical variable, where there are only two possible outcomes, Y = 0 (failure) and Y = 1 (success).

# create list of predictor variables based on forward selection results
predictor_vars = avg_table1[['Average_Shutouts', 'Average_Fielding_Perc', 'Average_Losses', 'Average_Runs', 'intercept']]

# regress variables on Success using logistic regression
logit = sm.Logit(avg_table1['Success'], predictor_vars)

# fit the model
result = logit.fit(method = "bfgs")
print result.summary()

Optimization terminated successfully.
         Current function value: 0.000000
         Iterations: 30
         Function evaluations: 48
         Gradient evaluations: 37
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                Success   No. Observations:                   30
Model:                          Logit   Df Residuals:                       25
Method:                           MLE   Df Model:                            4
Date:                Tue, 21 Mar 2017   Pseudo R-squ.:                   1.000
Time:                        16:18:36   Log-Likelihood:            -3.9703e-07
converged:                       True   LL-Null:                       -20.728
                                        LLR p-value:                 2.163e-08
=========================================================================================
                            coef    std err          z      P>|z|      [95.0% Conf. Int.]
-----------------------------------------------------------------------------------------
Average_Shutouts         23.5495   4.67e+06   5.04e-06      1.000     -9.16e+06  9.16e+06
Average_Fielding_Perc     0.1944   2.15e+09   9.03e-11      1.000     -4.22e+09  4.22e+09
Average_Losses           -6.3831   9.12e+05     -7e-06      1.000     -1.79e+06  1.79e+06
Average_Runs              0.3780    6.5e+04   5.82e-06      1.000     -1.27e+05  1.27e+05
intercept                 0.1901   2.05e+09   9.29e-11      1.000     -4.01e+09  4.01e+09
=========================================================================================

Complete Separation: The results show that there iscomplete separation.
In this case the Maximum Likelihood Estimator does not exist and the parameters
are not identified.

Our null hypothesis in this case was $H_{0}$: our model fits well vs $H_{A}$: our model does not fit well

The log-likelihood ratio test statistic p-value is 5.235e-09. This means that we would fail to reject the null hypthosis. This means that our model, that includes the average losses, average shutouts, average runs, and the average fielding percentage, fits well.

Our model is: $log_{\text{IT}} = 0.1901 + 23.5495X^{\text{shutouts}}_{1} + 0.1944X^{\text{fielding}}_{2} + (-6.3831)X^{\text{losses}}_{3} + 0.3780X^{\text{runs}}_{4}$

Conclusion¶

From our analysis, we can conclude that a team's success, where their win-loss ratio is greater than one, is determined by the team's average losses, average shutouts, average runs, and their average fielding percentage. From our logit model, $log_{\text{IT}} = 0.1901 + 23.5495X^{\text{shutouts}}_{1} + 0.1944X^{\text{fielding}}_{2} + (-6.3831)X^{\text{losses}}_{3} + 0.3780X^{\text{runs}}_{4}$, we can see that the average number of shutouts plays a significant role in determining if a team was successful or not because it has a large coefficient. This makes sense because this variable, while primarily a pitching variable, also can be combined with defense because you need both a strong bullpen and a strong defense in order to have many shutouts.

This brings us back to our initial analysis of the World Series winners. They did not necessarily have the best offense, but now we know that offense alone is not the key to being successful at baseball.

We recongize that this model may not be perfect because a teams win-loss ratio can fluctuate between each season. For example the Kansas City Royals won the World Series in 2015; however, their win-loss ratio would not be defined as "successful". This may be due to the fact that players get traded between teams and teams may switch up their strategies. Overall, we believe that our model does a good job of predicting a successful team.

Resources¶

Forward Selection Model: https://github.com/ajschumacher/ajschumacher.github.io/blob/master/20150423-forward_selection_with_statsmodels/index.md
Baseball Terminology: http://m.mlb.com/glossary

Contribution¶

Janice Luong: Github master and modeling

Sierra Tevlin: Initial analysis and written explanations

Worked together on SQL and plotting (coded together)

	Year	WS Winner	Total WS Runs	Total WS Hits	Max Runs Team	# of Runs	Max Hits Team	# of Hits
0	2006	St. Louis Cardinals	781	1484	New York Yankees	930	Minnesota Twins	1608
1	2007	Boston Red Sox	867	1561	New York Yankees	968	New York Yankees	1656
2	2008	Philadelphia Phillies	799	1407	Texas Rangers	901	Texas Rangers	1619
3	2009	New York Yankees	915	1604	New York Yankees	915	Los Angeles Angels of Anaheim	1604
4	2010	San Francisco Giants	697	1411	New York Yankees	859	Texas Rangers	1556
5	2011	St. Louis Cardinals	762	1513	Boston Red Sox	875	Boston Red Sox	1600
6	2012	San Francisco Giants	718	1495	Texas Rangers	808	Texas Rangers	1526
7	2013	Boston Red Sox	853	1566	Boston Red Sox	853	Detroit Tigers	1625
8	2014	San Francisco Giants	665	1407	Los Angeles Angels of Anaheim	773	Detroit Tigers	1557
9	2015	Kansas City Royals	724	1497	Toronto Blue Jays	891	Detroit Tigers	1515

	Team	franchID	Average Hits	Average Runs	Average Fielding %	Average Double Plays	Average Shutouts	Average HR Allowed	Average Wins	Average Losses	Win-Loss Ratio
0	Arizona Diamondbacks	ARI	1409.9	712.3	0.9837	145.6	8.9	158.2	78.2	83.8	0.933174
1	Atlanta Braves	ATL	1417.3	706.0	0.9845	150.9	10.5	154.2	83.7	78.3	1.068966
2	Baltimore Orioles	BAL	1459.2	724.3	0.9849	153.1	7.8	181.6	76.1	85.8	0.886946
3	Boston Red Sox	BOS	1511.7	806.6	0.9853	144.5	9.7	178.4	86.6	75.4	1.148541
4	Chicago Cubs	CHC	1407.3	688.7	0.9819	131.7	9.9	159.6	77.4	84.4	0.917062
5	Chicago White Sox	CHW	1422.4	713.0	0.9834	155.1	9.7	182.6	79.4	82.7	0.960097
6	Cincinnati Reds	CIN	1397.9	703.6	0.9848	139.4	9.9	176.2	80.1	81.9	0.978022
7	Cleveland Indians	CLE	1432.7	735.9	0.9834	156.0	9.6	157.8	79.5	82.4	0.964806
8	Colorado Rockies	COL	1491.3	768.5	0.9847	166.8	6.9	171.1	76.0	86.1	0.882695
9	Detroit Tigers	DET	1539.1	777.9	0.9837	154.6	9.5	172.9	86.4	75.6	1.142857
10	Houston Astros	HOU	1376.4	659.0	0.9839	145.2	9.0	154.0	70.9	91.0	0.779121
11	Kansas City Royals	KCR	1488.3	694.5	0.9830	152.1	8.3	121.7	75.1	86.9	0.864212
12	Los Angeles Angels of Anaheim	ANA	1475.3	751.8	0.9834	142.5	11.7	160.6	89.6	72.4	1.237569
13	Los Angeles Dodgers	LAD	1446.3	701.7	0.9840	142.7	13.8	137.6	87.5	74.4	1.176075
14	Miami Marlins	FLA	1401.2	681.4	0.9822	144.6	9.9	152.5	75.1	86.8	0.865207
15	Milwaukee Brewers	MIL	1416.0	725.8	0.9824	143.1	9.2	181.2	80.8	81.2	0.995074
16	Minnesota Twins	MIN	1461.2	729.1	0.9846	154.4	8.4	135.5	79.2	83.0	0.954217
17	New York Mets	NYM	1414.5	706.3	0.9840	135.1	12.2	145.1	81.7	80.3	1.017435
18	New York Yankees	NYY	1484.6	817.9	0.9854	140.9	8.1	200.6	92.6	69.4	1.334294
19	Oakland Athletics	OAK	1384.4	712.8	0.9827	149.3	12.0	150.2	82.0	79.9	1.026283
20	Philadelphia Phillies	PHI	1428.1	740.0	0.9850	141.9	10.6	173.9	84.8	77.2	1.098446
21	Pittsburgh Pirates	PIT	1392.8	664.7	0.9831	158.7	9.2	142.7	75.2	86.7	0.867359
22	San Diego Padres	SDP	1341.1	645.9	0.9847	135.9	10.9	137.4	77.9	84.2	0.925178
23	San Francisco Giants	SFG	1426.0	670.1	0.9844	134.6	13.5	127.1	82.7	79.2	1.044192
24	Seattle Mariners	SEA	1394.4	646.3	0.9845	152.2	9.8	149.0	74.9	87.1	0.859931
25	St. Louis Cardinals	STL	1476.4	732.7	0.9841	161.7	12.3	149.7	88.9	73.0	1.217808
26	Tampa Bay Rays	TBD	1389.7	721.0	0.9843	138.7	11.7	171.2	83.4	78.7	1.059720
27	Texas Rangers	TEX	1505.1	790.4	0.9822	161.5	10.6	181.1	84.6	77.5	1.091613
28	Toronto Blue Jays	TOR	1440.1	761.4	0.9845	152.9	10.7	193.4	82.0	80.0	1.025000
29	Washington Nationals	WSN	1391.7	682.5	0.9815	141.1	8.9	154.7	77.4	84.4	0.917062

	Average_Hits	Average_Runs	Average_Fielding_Perc	Average_Double_Plays	Average_Shutouts	Average_HR_Allowed	Average_Wins	Average_Losses	Success	intercept
0	1409.9	712.3	0.9837	145.6	8.9	158.2	78.2	83.8	0	1
1	1417.3	706.0	0.9845	150.9	10.5	154.2	83.7	78.3	1	1
2	1459.2	724.3	0.9849	153.1	7.8	181.6	76.1	85.8	0	1
3	1511.7	806.6	0.9853	144.5	9.7	178.4	86.6	75.4	1	1
4	1407.3	688.7	0.9819	131.7	9.9	159.6	77.4	84.4	0	1