What Makes A Baseball Team Successful?

Statistics 141B Final Project

By Janice Luong and Sierra Tevlin

Introduction

The singular goal in baseball is to get more runs than the opposing team. But that goal can be reached in a multitude of different ways. Like many other sports, baseball relies heavily on statistics to find what strategies are effective and which are not. These strategies are not only used to pick winning plays but also allow coaches to pick successful players to make their team stronger. The data that we are anaylizing, the Lahman 2015 baseball database, includes statistics that range from 2015 back until 1871. It includes information about batting and pitching for both individual players and for the teams. Our goal with this anaylsis is to determine what variables make a team successful, where success is defined as a win-loss ratio greater than 1. Strategies in baseball change and players are contasntly retiring, so our goal was to look at the years that are comparable to today, so we focused on data from the year 2006 to the year 2015. Additionally, we are going to be specifically looking at team statistics, rather than specific players, to determine success.

Initial Anaylsis

To begin to get an idea of what makes a team successful, we explored what would arguably be the most successful teams in baseball: those that won the World Series. And what we thought would be the indicators of a good team: number of runs and number of hits. And then we compared them to the teams that got the most runs and the most hits.

In [1]:
import sqlite3
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import statsmodels.api as sm
from IPython.display import Image
import warnings
warnings.filterwarnings('ignore')
In [2]:
# Reads database
baseball_con = sqlite3.connect("lahman2015.sqlite")
In [3]:
# World Series winners with their total number of hits and runs
WSWinners = pd.read_sql_query("SELECT yearID as 'Year', name as 'WS Winner', R as 'Total WS Runs', SUM(H) as 'Total WS Hits' "
                              "FROM Teams WHERE (yearID > 2005 AND yearID < 2016) AND WSWin = 'Y' GROUP BY yearID"
                              , baseball_con)

# Teams that had the most runs each season
max_runs = pd.read_sql_query("SELECT yearID as 'Year', name as 'Max Runs Team', MAX(R) as '# of Runs' "
                             "FROM Teams WHERE (yearID > 2005 AND yearID < 2016) GROUP BY yearID LIMIT 10", baseball_con)

# Teams that had the most hits each season
max_hits = pd.read_sql_query("SELECT yearID as 'Year', name as 'Max Hits Team', MAX(H) as '# of Hits' "
                             "FROM Teams WHERE (yearID > 2005 AND yearID < 2016) GROUP BY yearID", baseball_con)

# Merge together so its easier to analyze
max_runs_hit = max_runs.merge(max_hits, how = 'outer', on = 'Year')

WS_max = WSWinners.merge(max_runs_hit, how = 'left', on = 'Year')
WS_max
Out[3]:
Year WS Winner Total WS Runs Total WS Hits Max Runs Team # of Runs Max Hits Team # of Hits
0 2006 St. Louis Cardinals 781 1484 New York Yankees 930 Minnesota Twins 1608
1 2007 Boston Red Sox 867 1561 New York Yankees 968 New York Yankees 1656
2 2008 Philadelphia Phillies 799 1407 Texas Rangers 901 Texas Rangers 1619
3 2009 New York Yankees 915 1604 New York Yankees 915 Los Angeles Angels of Anaheim 1604
4 2010 San Francisco Giants 697 1411 New York Yankees 859 Texas Rangers 1556
5 2011 St. Louis Cardinals 762 1513 Boston Red Sox 875 Boston Red Sox 1600
6 2012 San Francisco Giants 718 1495 Texas Rangers 808 Texas Rangers 1526
7 2013 Boston Red Sox 853 1566 Boston Red Sox 853 Detroit Tigers 1625
8 2014 San Francisco Giants 665 1407 Los Angeles Angels of Anaheim 773 Detroit Tigers 1557
9 2015 Kansas City Royals 724 1497 Toronto Blue Jays 891 Detroit Tigers 1515

When comparing the World Series winners to the teams that had the most hits and runs, it is clear that the teams that had the most runs or the most hits in a given season were not necessarily the World Series winners. There are only two years that the World Series winner was also the team with the maximum number of runs: the New York Yankees in 2009 and the Boston Red Sox in 2013. And the World Series winner was never the team with the maximum number of hits.

Interestingly though, many of the teams that had the maximum number of runs in a season also were the teams that had the maximum number of hits. This makes sense because to be able to get runs, the players have to be willing to hit the ball in the first place.

Expanding the Defition of Success

Baseball strategy can be broken down into three categories: offense, defense, and pitching. From what we saw above, offense (including the number of hits and runs) alone does not make a team successful. To get a better picture of what makes a successful team, we explored additional defensive and pitching variables:

Defense:

  • Fielding Percentage - the percent of time a defensive player properly handles a ball, either thrown or batted
  • Number of Double Plays - the number of times that the defensive team gets two outs in one continuous play
  • Pitching:

  • Number of shutouts - the number of times a starting pitcher pitches an entire game and does not allow the opposing team to score once
  • Number of homeruns allowed - the number of homeruns allowed by the pitchers on a team
  • To better look at how the variables that we chose affect a team's success, we created a table that includes each variable averaged over our time period (2006-2015) for each team. Additionally, we included the average number of losses and wins, calculated the average win-loss ratio for each team.

    In [4]:
    # Create a table to get an average on all the variables
    avg_table = pd.read_sql("SELECT name as Team, franchID, AVG(H) as 'Average Hits', AVG(R) as 'Average Runs', AVG(FP) as 'Average Fielding %', "
                            "AVG(DP) AS 'Average Double Plays', AVG(SHO) AS 'Average Shutouts', AVG(HR) AS 'Average HR Allowed', "
                            "AVG(W) AS 'Average Wins', AVG(L) AS 'Average Losses' FROM Teams WHERE (yearID > 2005 AND yearID < 2016) "
                            "GROUP BY franchID ORDER BY Team", baseball_con)
    
    # Get the win-loss ratio
    avg_table['Win-Loss Ratio'] = avg_table['Average Wins']/avg_table['Average Losses']
    avg_table
    
    Out[4]:
    Team franchID Average Hits Average Runs Average Fielding % Average Double Plays Average Shutouts Average HR Allowed Average Wins Average Losses Win-Loss Ratio
    0 Arizona Diamondbacks ARI 1409.9 712.3 0.9837 145.6 8.9 158.2 78.2 83.8 0.933174
    1 Atlanta Braves ATL 1417.3 706.0 0.9845 150.9 10.5 154.2 83.7 78.3 1.068966
    2 Baltimore Orioles BAL 1459.2 724.3 0.9849 153.1 7.8 181.6 76.1 85.8 0.886946
    3 Boston Red Sox BOS 1511.7 806.6 0.9853 144.5 9.7 178.4 86.6 75.4 1.148541
    4 Chicago Cubs CHC 1407.3 688.7 0.9819 131.7 9.9 159.6 77.4 84.4 0.917062
    5 Chicago White Sox CHW 1422.4 713.0 0.9834 155.1 9.7 182.6 79.4 82.7 0.960097
    6 Cincinnati Reds CIN 1397.9 703.6 0.9848 139.4 9.9 176.2 80.1 81.9 0.978022
    7 Cleveland Indians CLE 1432.7 735.9 0.9834 156.0 9.6 157.8 79.5 82.4 0.964806
    8 Colorado Rockies COL 1491.3 768.5 0.9847 166.8 6.9 171.1 76.0 86.1 0.882695
    9 Detroit Tigers DET 1539.1 777.9 0.9837 154.6 9.5 172.9 86.4 75.6 1.142857
    10 Houston Astros HOU 1376.4 659.0 0.9839 145.2 9.0 154.0 70.9 91.0 0.779121
    11 Kansas City Royals KCR 1488.3 694.5 0.9830 152.1 8.3 121.7 75.1 86.9 0.864212
    12 Los Angeles Angels of Anaheim ANA 1475.3 751.8 0.9834 142.5 11.7 160.6 89.6 72.4 1.237569
    13 Los Angeles Dodgers LAD 1446.3 701.7 0.9840 142.7 13.8 137.6 87.5 74.4 1.176075
    14 Miami Marlins FLA 1401.2 681.4 0.9822 144.6 9.9 152.5 75.1 86.8 0.865207
    15 Milwaukee Brewers MIL 1416.0 725.8 0.9824 143.1 9.2 181.2 80.8 81.2 0.995074
    16 Minnesota Twins MIN 1461.2 729.1 0.9846 154.4 8.4 135.5 79.2 83.0 0.954217
    17 New York Mets NYM 1414.5 706.3 0.9840 135.1 12.2 145.1 81.7 80.3 1.017435
    18 New York Yankees NYY 1484.6 817.9 0.9854 140.9 8.1 200.6 92.6 69.4 1.334294
    19 Oakland Athletics OAK 1384.4 712.8 0.9827 149.3 12.0 150.2 82.0 79.9 1.026283
    20 Philadelphia Phillies PHI 1428.1 740.0 0.9850 141.9 10.6 173.9 84.8 77.2 1.098446
    21 Pittsburgh Pirates PIT 1392.8 664.7 0.9831 158.7 9.2 142.7 75.2 86.7 0.867359
    22 San Diego Padres SDP 1341.1 645.9 0.9847 135.9 10.9 137.4 77.9 84.2 0.925178
    23 San Francisco Giants SFG 1426.0 670.1 0.9844 134.6 13.5 127.1 82.7 79.2 1.044192
    24 Seattle Mariners SEA 1394.4 646.3 0.9845 152.2 9.8 149.0 74.9 87.1 0.859931
    25 St. Louis Cardinals STL 1476.4 732.7 0.9841 161.7 12.3 149.7 88.9 73.0 1.217808
    26 Tampa Bay Rays TBD 1389.7 721.0 0.9843 138.7 11.7 171.2 83.4 78.7 1.059720
    27 Texas Rangers TEX 1505.1 790.4 0.9822 161.5 10.6 181.1 84.6 77.5 1.091613
    28 Toronto Blue Jays TOR 1440.1 761.4 0.9845 152.9 10.7 193.4 82.0 80.0 1.025000
    29 Washington Nationals WSN 1391.7 682.5 0.9815 141.1 8.9 154.7 77.4 84.4 0.917062

    We expected our variables to correlate with the win-loss ratio as follows:

  • Average hits will be positively correlated with the win-loss ratio because if a team is more willing to hit the ball, they will have more runs and will win more often
  • Average runs will be positively correlated with the win-loss ratio because if a team gets more runs, they have a higher chance of winning more games
  • Average fielding percntage will be positively correlated with the win-loss ratio because if a team can properly handle the ball, they will have a better change of winning games
  • Average number of double plays will be positively correlated with the win-loss ratio because a team that knows how to effectively get the opposing team out, will win more often
  • Average number of shutouts will be positively correlated with the the win-loss ratio because if a team has good pitching, they will likely be able to win more games
  • Average number of homeruns allowed will be negatively correlated with the win-loss ratio because if a team lets the opposing team score more often, they will be less likely to win games
  • In [5]:
    # heat map to look at the correlations between the variables
    corr = avg_table.corr()
    sns.heatmap(corr,
                xticklabels=corr.columns.values,
                yticklabels=corr.columns.values)
    plt.show()
    

    The above heatmap allows us to see the correlation between our variables. Specifically we are looking at the right most column in the first 6 rows. We see that hits, runs, fielding percentage, shutouts, and homeruns allowed are all positively correlated to the win-loss ratio. The only variable that is negatively correlated with the win-loss ratio is double plays. Most of these are what we expected but the average double plays and the number of homeruns allowed were surprising. To have a better look at the correlations, we plotted pairwise scatterplots of the same variables, but left out the average wins and losses to make it easier to understand.

    In [6]:
    #pairwise scatterplots to get a detailed look at our correlations
    sns.pairplot(avg_table, vars=["Average Hits", "Average Runs", "Average Fielding %", "Average Double Plays", "Average Shutouts",
                                 "Average HR Allowed", "Win-Loss Ratio"])
    plt.show()
    
    In [7]:
    #Average Fielding % and Win-Loss Ratio correlation 
    sns.regplot(avg_table['Win-Loss Ratio'], avg_table['Average Fielding %'], ci = None)
    sns.plt.title('Average Fielding Percentage vs Win-Loss Ratio')
    plt.show()
    

    Like we mentioned above, we plotted a pairwise scatterplot of our six variables and the win-loss ratio. Again, the column that we are specifically interested in is the right most column. The scale of the average fielding percentage was so small that the correlation did not show up in the scatterplot, so we individually plotted the scatterplot. We get the same results as the heat map above, but this scatterplot allows us to see the plots more clearly.

    The very slight negative correlation between double plays and the win-loss ratio may because teams try to avoid getting errors (or misplaying a ball in a manner that allows a batter to advance one or more bases). Additionally, the surprising positive correlation between the number of homeruns allowed and the win-loss ratio may be due to a lack of good pitching but a good offensive team for successful teams.

    Modeling

    The next thing that we wanted to do was model our data using linear regression. But before we do that, we wanted to figure out which of our 8 variables (including number of wins and number of losses) are actually needed in determining if a team is successful or not. As a reminder, our definition of success is a team that has a win-loss ratio that is greater than 1.

    In [8]:
    # adding a binomial 'success' column to our dataframe where 1 is a success and 0 is not
    # the spaces in our are table were giving us issues so we created a new table with underscores for this purpose
    avg_table1 = pd.read_sql("SELECT AVG(H) as 'Average_Hits', AVG(R) as 'Average_Runs', AVG(FP) as 'Average_Fielding_Perc', "
                             "AVG(DP) AS 'Average_Double_Plays', AVG(SHO) AS 'Average_Shutouts', AVG(HR) AS 'Average_HR_Allowed', "
                             "AVG(W) AS 'Average_Wins', AVG(L) AS 'Average_Losses' FROM Teams WHERE (yearID > 2005 AND yearID < 2016) "
                             "GROUP BY franchID ORDER BY name", baseball_con)
    
    avg_table1['Win_Loss_Ratio'] = avg_table1['Average_Wins']/avg_table1['Average_Losses']
    avg_table1.head()
    
    # adding the success column to our dataframe where any ratio greater than 1 is a success and 0 otherwise
    def f(row):
        if row['Win_Loss_Ratio'] > 1:
            val = 1
        else:
            val = 0
        return val
    
    avg_table1['Success'] = avg_table1.apply(f, axis=1)
    
    # delete win-loss ratio, its repetative. Add intercept term for modeling
    avg_table1 = avg_table1.drop('Win_Loss_Ratio', 1)
    avg_table1['intercept'] = 1
    
    avg_table1.head()
    
    Out[8]:
    Average_Hits Average_Runs Average_Fielding_Perc Average_Double_Plays Average_Shutouts Average_HR_Allowed Average_Wins Average_Losses Success intercept
    0 1409.9 712.3 0.9837 145.6 8.9 158.2 78.2 83.8 0 1
    1 1417.3 706.0 0.9845 150.9 10.5 154.2 83.7 78.3 1 1
    2 1459.2 724.3 0.9849 153.1 7.8 181.6 76.1 85.8 0 1
    3 1511.7 806.6 0.9853 144.5 9.7 178.4 86.6 75.4 1 1
    4 1407.3 688.7 0.9819 131.7 9.9 159.6 77.4 84.4 0 1
    In [9]:
    # found this def from a github repo, link to the site in Resources below
    
    def forward_selected(data, response):
        """Linear model designed by forward selection.
    
        Parameters:
        -----------
        data : pandas DataFrame with all possible predictors and response
    
        response: string, name of response column in data
    
        Returns:
        --------
        model: an "optimal" fitted statsmodels linear model
               with an intercept
               selected by forward selection
               evaluated by adjusted R-squared
        """
        remaining = set(data.columns)
        remaining.remove(response)
        selected = []
        current_score, best_new_score = 0.0, 0.0
        while remaining and current_score == best_new_score:
            scores_with_candidates = []
            for candidate in remaining:
                formula = "{} ~ {}".format(response,
                                               ' + '.join(selected + [candidate]))
                score = smf.ols(formula, data).fit().rsquared_adj
                scores_with_candidates.append((score, candidate))
            scores_with_candidates.sort()
            best_new_score, best_candidate = scores_with_candidates.pop()
            if current_score < best_new_score:
                remaining.remove(best_candidate)
                selected.append(best_candidate)
                current_score = best_new_score
        formula = "{} ~ {}".format(response,
                                       ' + '.join(selected))
        model = smf.ols(formula, data).fit()
        return model
    

    We used forward selection in this cause because we did not want to over fit out model by using backward selection.

    In [10]:
    # Run forward selection on our data
    model = forward_selected(avg_table1, 'Success')
    
    print model.model.formula
    
    Success ~ Average_Losses + Average_Shutouts + Average_Runs + Average_Fielding_Perc + intercept
    

    From the forward selection function that we ran, we found that success is determined by average losses, average shutouts, average runs, and the average fielding percentage. This is intersting because it includes one variable from each of our categories, offense (number of runs), defense (feilding percentage) and pitching (number of shutouts), as well as a record variable (number of losses).

    One of the issues that we ran into was perfect seperation. Here is an image that explains what perfect seperation is.

    In [11]:
    Image(filename = "cblrK.png")
    
    Out[11]:

    Some possible reasons why we may have ran into this issue are:

    1. One possible reason you run into prefect separation error is that you keep too many x variables. If you use many x variables, the combined value of x for the rows with y = 0 may gather in one area, and those with y = 1 may gather in another area, then prefect separation happens.

    2. Another possibility is that one of the x variables causes the same problem as shown in the plot.

    3. Maybe (1) and (2) together cause prefect separation, that is, you could split the data into two groups based on a subset of x variables.

    We decided to use another function/package which would ignore perfect separation. It seems that if we use .fit(method = "bfgs"), perfect separation would be ignored. We believed that perfect separation is not a big deal because we could still do the prediction.

    We chose to use a logistic regression model because our dependent variable "Success" is a binomial categorical variable, where there are only two possible outcomes, Y = 0 (failure) and Y = 1 (success).

    In [12]:
    # create list of predictor variables based on forward selection results
    predictor_vars = avg_table1[['Average_Shutouts', 'Average_Fielding_Perc', 'Average_Losses', 'Average_Runs', 'intercept']]
    
    # regress variables on Success using logistic regression
    logit = sm.Logit(avg_table1['Success'], predictor_vars)
    
    # fit the model
    result = logit.fit(method = "bfgs")
    print result.summary()
    
    Optimization terminated successfully.
             Current function value: 0.000000
             Iterations: 30
             Function evaluations: 48
             Gradient evaluations: 37
                               Logit Regression Results                           
    ==============================================================================
    Dep. Variable:                Success   No. Observations:                   30
    Model:                          Logit   Df Residuals:                       25
    Method:                           MLE   Df Model:                            4
    Date:                Tue, 21 Mar 2017   Pseudo R-squ.:                   1.000
    Time:                        16:18:36   Log-Likelihood:            -3.9703e-07
    converged:                       True   LL-Null:                       -20.728
                                            LLR p-value:                 2.163e-08
    =========================================================================================
                                coef    std err          z      P>|z|      [95.0% Conf. Int.]
    -----------------------------------------------------------------------------------------
    Average_Shutouts         23.5495   4.67e+06   5.04e-06      1.000     -9.16e+06  9.16e+06
    Average_Fielding_Perc     0.1944   2.15e+09   9.03e-11      1.000     -4.22e+09  4.22e+09
    Average_Losses           -6.3831   9.12e+05     -7e-06      1.000     -1.79e+06  1.79e+06
    Average_Runs              0.3780    6.5e+04   5.82e-06      1.000     -1.27e+05  1.27e+05
    intercept                 0.1901   2.05e+09   9.29e-11      1.000     -4.01e+09  4.01e+09
    =========================================================================================
    
    Complete Separation: The results show that there iscomplete separation.
    In this case the Maximum Likelihood Estimator does not exist and the parameters
    are not identified.
    

    Our null hypothesis in this case was $H_{0}$: our model fits well vs $H_{A}$: our model does not fit well

    The log-likelihood ratio test statistic p-value is 5.235e-09. This means that we would fail to reject the null hypthosis. This means that our model, that includes the average losses, average shutouts, average runs, and the average fielding percentage, fits well.

    Our model is: $log_{\text{IT}} = 0.1901 + 23.5495X^{\text{shutouts}}_{1} + 0.1944X^{\text{fielding}}_{2} + (-6.3831)X^{\text{losses}}_{3} + 0.3780X^{\text{runs}}_{4}$

    Conclusion

    From our analysis, we can conclude that a team's success, where their win-loss ratio is greater than one, is determined by the team's average losses, average shutouts, average runs, and their average fielding percentage. From our logit model, $log_{\text{IT}} = 0.1901 + 23.5495X^{\text{shutouts}}_{1} + 0.1944X^{\text{fielding}}_{2} + (-6.3831)X^{\text{losses}}_{3} + 0.3780X^{\text{runs}}_{4}$, we can see that the average number of shutouts plays a significant role in determining if a team was successful or not because it has a large coefficient. This makes sense because this variable, while primarily a pitching variable, also can be combined with defense because you need both a strong bullpen and a strong defense in order to have many shutouts.

    This brings us back to our initial analysis of the World Series winners. They did not necessarily have the best offense, but now we know that offense alone is not the key to being successful at baseball.

    We recongize that this model may not be perfect because a teams win-loss ratio can fluctuate between each season. For example the Kansas City Royals won the World Series in 2015; however, their win-loss ratio would not be defined as "successful". This may be due to the fact that players get traded between teams and teams may switch up their strategies. Overall, we believe that our model does a good job of predicting a successful team.

    Resources

    Contribution

    Janice Luong: Github master and modeling

    Sierra Tevlin: Initial analysis and written explanations

    Worked together on SQL and plotting (coded together)

    In [ ]: