Pythagorean Expectation Across Various Sports

Overview of Pythagorean Expectation

Pythagorean expectation is a formula that originated with Bill James in the early 1980s. The formula attempted to estimated the number of games a baseball team should have won in based on the number of runs scored and allowed. The basic formula is: \[\text{Win Percentage} = \frac{\text{Runs Scored}^2}{\text{Runs Scored}^2 + \text{Runs Allowed}^2} \] The Pythagorean winning percentage is often compared with a team’s actual winning percentage in baseball to identify teams that are underperforming or overperforming and whether or not a team’s performance is sustainable over the course of the season. The Pythagorean winning percentage has become popular enough that a team’s expected recording according to this formula is displayed on standings on MLB.com under the X-W/L column. MLB standings with Pythagorean formula.

When considering the AL East in 2022, the Yankees record is worse than what was expected based on their run differential, while the Orioles had a better record than expected. In addition to determining if a team is under or overperforming in a season, the Pythagorean record can also help inform a team how good that may actually be when attempting to make decisions during free agency in the next season.

Bill James initially proposed an exponent of 2 in this formula. There have been a number of studies looking at the exponent with a goal of finding a better fit. In baseball, the most accurate exponent has been found to be 1.83 and is used most often.1 One simple way to derive this exponent is using a transformed version of the Pythagorean formula and using linear regression. This is a simplified approach to it, but even the results of linear regression produces exponent values that are close to the ideal values found for each sport.

The formula for Pythagorean win percentage, with an unknown exponent can be written as: \[\text{Win Percentage} = \frac{W}{W + L} = \frac{\text{R}^x}{\text{R}^x + \text{RA}^x} \] where \(R\) is the runs scored by a team and \(RA\) is the runs allowed by a team.

Using algebra, the formula can be rearranged as shown below where \(x\) can be the parameter of a linear regression model.

\[WR^x + WRA^x = WR^x + LR^x\] \[WRA^x = LR^x\]

\[\frac{W}{L} = \frac{R^x}{RA^x}\] \[ log(\frac{W}{L}) = x*log(\frac{R}{RA})\] With this formula, we can solve for \(x\) using simple linear regression where \(log(\frac{W}{L})\) is the response variable and \(log(\frac{R}{RA})\) is the explanatory variable.

The value of this exponent varies across different sports and the remainder of this post shows how to get the exponent for MLB, NHL, NFL, and NBA using code in R. A discussion about what the different values of the exponent mean won’t be covered here, but a brief, but thorough explanation about the exponent in the formula is given on this Wikipedia page in the theoretical explanation section.

Pythagorean Exponent For Different Sports

MLB

Since the Pythagorean winning percentage is used most commonly in baseball, I wanted to look at baseball first. As previously mentioned, the ideal exponent has been found to be 1.83, so I will compare the result I get from here to that exponent.

The data to calculate the exponent comes from the Lahman package, which contains baseball statistics all the way from 1871 through the present. Since some statistics were recorded early on, any year after 1900 was used. The run differential and winning percentage was calculated for each team every season. Linear regression was then used to calculate what the value of the exponent would be. Note that from this formula, there is no intercept in the model, which is why -1 is included in the call to the lm function.

The resulting value given is 1.854, which is close to the ideal value of 1.83 and can be derived simply using only linear regression.

library(tidyverse)
library(Lahman)
mlb <- Teams %>% 
    filter(yearID > 1900) %>%
    select(teamID, yearID, lgID, G, W, L, R, RA) %>%
    mutate(RD = R - RA, 
           Wpct = W / (W + L))

mlb_pyt <- mlb %>%
    mutate(log_w_ratio = log(W/L),
           log_r_ratio = log(R / RA))
pyt_fit <- lm(log_w_ratio ~ -1 + log_r_ratio, data = mlb_pyt)
round(coef(pyt_fit), digits = 3)
## log_r_ratio 
##       1.854

NHL

For NHL data, I used the nhlapi package, which uses the open NHL API to get data. The standings for all seasons from 1990 through 2022 were pulled, where the point differential could be calculated using the goals scored and the goals against.

Linear regression is used again in the same way that is was for the MLB data and the resulting value for the exponent is 2.35. A study of Pythagorean winning percentage by Kevin D. Dayaratna and Steven J. Miller for the NHL found an ideal exponent of just above 2.2

library(nhlapi)

nhl <- nhl_standings(seasons = 1990:2022)
nhl_data <- nhl$teamRecords %>%
    bind_rows() %>%
    mutate(point_diff = goalsScored - goalsAgainst,
           Win_pct = leagueRecord.wins/leagueRecord.losses,
           log_win_ratio = log(leagueRecord.wins/leagueRecord.losses),
           log_point_ratio = log(goalsScored/goalsAgainst))
pyt_fit_nhl <- lm(log_win_ratio ~ -1 + log_point_ratio, data = nhl_data)
coef(pyt_fit_nhl)
## log_point_ratio 
##        2.354845

NFL

The NFL data was scraped using the espnscrapeR package. The package can be used to scrape the NFL standings for each year. The earliest year that provided the points for and against a team was 2002 so the data for the model goes from 2002-2021.

One adjustment I had to make for the NFL data was I had to use log1p() rather than log() for the wins and losses of each team. Since there have been a few instances of a team winning or losing zero games, the natural logarithm would have been undefined. log1p() calculates \(log(1 + x)\), which adjusts those values that would have been zero. There may be other ways to handle those specific seasons, but log1p() was a quick way that I thought of to cover that. I also used map_df() to get the standings for each season and return a data frame without using a for loop. The map functions in the purrr package are very useful in a variety of scenarios.

The output from the linear regression model gives an exponent of 2.396. Football Outsiders has previously used an exponent of 2.37, which they credited to Daryl Morey. They had also suggested a new exponent that adjusts for offensive environment.3.

library(espnscrapeR)
library(purrr)

nfl_data <- map_df(2002:2021, get_nfl_standings)

nfl_pyt <- nfl_data %>%
    mutate(log_w_ratio = log1p(wins) - log1p(as.numeric(losses)),
           log_point_ratio = log(pts_for/pts_against))

pyt_fit_nfl <- lm(log_w_ratio ~ -1 + log_point_ratio, data = nfl_pyt)
coef(pyt_fit_nfl)
## log_point_ratio 
##        2.396146

NBA

The NBA data was scraped using the hoopr package, which also uses ESPN to scrape the yearly standings for teams. Data was taken from the 1994-1995 season through the 2021-2022 season. The average points for and against each team were givent in the standings, so the total number of points scored and allowed for the season were calculated by multiplying by the total number of games in the season (this is generally 82, but there are a few lockout seasons that were shortened and the COVID-19 shortened season as well).

The exponent given from linear regression here is 14.085. Daryl Morey is credited with deriving the exponent for basketball and he got an exponent of 13.91.

library(hoopR)
nba_data <- map_df(1994:2021, espn_nba_standings)
nba_data <- nba_data %>%
    mutate(wins = as.numeric(wins),
           losses = as.numeric(losses),
           avgpointsfor = as.numeric(avgpointsfor),
           avgpointsagainst = as.numeric(avgpointsagainst),
           games = wins + losses,
           points_for = games*avgpointsfor,
           points_against = games*avgpointsagainst,
           point_diff = points_for - points_against,
           log_w_ratio = log(wins/losses),
           log_point_ratio = log(points_for/points_against))
pyt_fit_nba <- lm(log_w_ratio ~ -1 + log_point_ratio, data = nba_data)
coef(pyt_fit_nba)
## log_point_ratio 
##        14.08544

Summary

As the table below shows, the exponents that are derived using a simple linear regression model that uses the log ratio of the point/run differential to explain the log ratio of wins and losses are pretty close to the ideal exponents that have been derived by others. Although the ideal exponents have been derived by more rigorous statistical methods, a simple linear regression model can provide estimates of the Pythagorean exponent that are in the general ballpark of these expected values.

League Exponent Ideal
MLB 1.854 1.83
NHL 2.355 Above 2
NFL 2.396 2.37
NBA 14.085 13.91

  1. Baseball Reference uses an exponent of 1.83 in their calculation of Pythagorean winning percentage.↩︎

  2. A published paper shows the methodology they used to get this estimate.↩︎

  3. https://www.footballoutsiders.com/stat-analysis/2017/presenting-adjusted-pythagorean-theorem↩︎

David Teuscher
David Teuscher
Quantitative Modeling Analyst

Related