Published: May 23, 2018
# Load these packages
library(tidyverse)
library(alabama)
In Wayne Winston’s book Mathletics he devotes a chapter to “Rating Sports Teams.” This technique is also covered in a couple of his other wonderful excel books, the newest is
The technique is covered (less thorougly) in an older article avaiable on the Office site, titled “Using Solver to Rate Sports Teams”.
Let’s implement a very simple version of this in R. I’m not an optimization expert. After digging around the CRAN task view for optimization I settled on the package alabama package. Mainly because it was simple to use, nonlinear, and allowed for different types of constraints.
Read the article to get an idea of the procedure we’re going to use to come up with our ratings.
Here’s a summary:
Home Edge + Home Team Rating - Away Team Rating
. Home Edge
is
the home field advantage. The standard NFL home field advantage is
3, but we’ll let the model determine the optimal value.That’s the basic idea. We’ll fill in with more detail when it’s required.
Let’s code it up.
We’re going to NFL scores for this season. NFL scores are some of the easiest sporting data to find. We’re going to use the data provided by Nate Silver’s Fivethirtyeight.com. They make this (and past) season’s scores available for their ELO project on github.
nfl_scores_url <- 'https://projects.fivethirtyeight.com/nfl-api/2017/nfl_games_2017.csv'
nfl <- read_csv(nfl_scores_url)
glimpse(nfl)
## Observations: 267
## Variables: 12
## $ date <date> 2017-09-07, 2017-09-10, 2017-09-10, 2017-09-10, 201...
## $ season <int> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017...
## $ neutral <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ playoff <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ team1 <chr> "NE", "DET", "CLE", "CHI", "CIN", "BUF", "TEN", "HOU...
## $ team2 <chr> "KC", "ARI", "PIT", "ATL", "BAL", "NYJ", "OAK", "JAX...
## $ elo1 <dbl> 1687.395, 1500.732, 1335.768, 1383.935, 1515.970, 14...
## $ elo2 <dbl> 1613.149, 1536.822, 1598.853, 1617.205, 1491.100, 14...
## $ elo_prob1 <dbl> 0.6903093, 0.5415098, 0.2422707, 0.2751518, 0.626524...
## $ score1 <int> 27, 35, 18, 17, 0, 21, 16, 7, 17, 46, 17, 3, 19, 29,...
## $ score2 <int> 42, 23, 21, 23, 20, 12, 26, 29, 30, 9, 9, 23, 3, 19,...
## $ result1 <int> 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0...
There’s some housekeeping necassary before we start coding the model.
Games on neutral fields should alter the forecast formula, there’s no home field advantage.
We’re keeping the following columns:
date
neutral
team1
and team2
but changing the names to home
and away
score1
and score2
but changing the names to h_score
and
a_score
The data doesn’t specify which team is home but a quick look at week
one’s scores show that team1
and
score1
is the home team and score. Changing the column names isn’t
necassary but you’ll be glad you did next season if you come back to the
script. It’s one less thing to remember/look up.
Add a margin of victory column for the home team. We’re going to need this later.
games <- nfl %>% # remove future games
select(date,
neutral,
home = team1,
away = team2,
h_score = score1,
a_score = score2) %>%
mutate(h_mov = h_score - a_score)
Let’s look at those neutral games.
paste("There was",
nrow(games),
"games played in 2017.",
sum(games$neutral),
"of those games were played on a neutral field.")
## [1] "There was 267 games played in 2017. 6 of those games were played on a neutral field."
6 shouldn’t have a huge impact. It’s a little less than three percent of games. We’ll test that theory. First we’ll ignore neutral games.
Create the basic forecast function. Remember the formula is
Home Edge + Home Team Rating - Away Team Rating
.
home_forecast <- function(home_rating,
visitor_rating,
home_edge) {
(home_edge + home_rating) - visitor_rating
}
Create a vector consisting of every team in the NFL.
teams <- sort(unique(games$home))
length(teams) == 32 # there's 32 NFL teams
## [1] TRUE
To start each team will be given a rating of zero. We’re also going to add an extra zero to represent the home field “edge” or advantage. Finally, add the team names to the ratings vector. This is for matching and updating ratings (discussed later).
ratings <- rep(0,33)
names(ratings) <- c(teams, 'hfa') # hfa = home field advantage
constrOptim.nl
is the function from the alabama
package that will
perform the optimization. The function’s argument fn
is a
Nonlinear objective function that is to be optimized. A scalar function that takes a real vector as argument and returns a scalar that is the value of the function at that point.
In other words, we need to create a function that sums the squared
errors between the forecast and actual game scores. constrOptim.nl
minimizes this number by trying out different team ratings. When a
minimum is reached the algorithm will stop and we can access the team
ratings that best explain the results so far this season.
The function is simple. Pass in the team ratings as the argument. The
forecast for each game is calculated via the home_forecast
function.
Each team’s rating is looked up by passing the home
and away
columns
to ratings vector matching the ratings
names to the value in the
columns. The home field edge is matched up with the 33rd element in the
ratings
vector. Next, a error column named resid
subtracts the
forecast from the home team’s margin of victory, this value is squared.
All of the squared errors are summed up in the summarise
function in
the column sse
. That scalar value is returned. This returned value is
the number constrOptim.nl
will attempt to minimize.
sqr_err <- function(ratings) {
games %>%
mutate(forecast = home_forecast(ratings[home],
ratings[away],
ratings[33])) %>% # 33 is the hfa
mutate(resid = (h_mov - forecast) ^ 2) %>%
summarise(sse = sum(resid)) %>%
pull(sse)
}
Winston’s article suggests setting up a constraint in excel’s solver.
It is convenient to make our average team rating equal to 0. A team with a positive rating is better than average and a team with a negative rating is worse than average.
alabama
’s way of doing this feels a little akward to me but it works.
We need the mean of the team ratings vector to equal zero when the
optimization is complete. Remember that the ratings
vector includes
the hfa. That needs to be removed from the calculation.
alabama
makes us place this constraint in a function. Because our
constraint is that the team ratings must average to zero the function
must be passed via the heq
argument.
a vector function specifying equality constraints such that heq[j] = 0 for all j
The structure of this function is taken directly from the alabama
documentation.
set_team_avg_zero <- function(ratings){
h <- rep(NA, 1)
h[1] <- mean(ratings[-33])
h
}
set_team_avg_zero(ratings)
## [1] 0
It’s not too tough to parse out what’s going on. Inside the function h
is a vector of constraints. Assign the equality, i.e. this value on the
right hand side of the assignment arrow must equal zero. Return the
vector of constraints.
From here it’s easy. We need to pass constrOptim.nl
three arguments.
par
starting vector of parameter values. The ratings
vector.fn
function that is to be optimized. The sqr_err
function.heq
a vector function specifying equality constraints. The
set_team_avg_zero
function.There’s a lot of message output with constrOptim.nl
. To save space
I’ll surpress it here. There’s nothing we can’t get out ourselves later.
note: when you run this on your own computer it will take a minute.
mod <- constrOptim.nl(par = ratings,
fn = sqr_err,
heq = set_team_avg_zero)
mod
is a list of six items. The item we’re interested is $par
(sse_ratings <- mod$par)
## ARI ATL BAL BUF CAR CHI
## -3.729083 5.543891 3.137838 -3.602903 4.610258 -1.387323
## CIN CLE DAL DEN DET GB
## -5.223082 -11.214136 1.890716 -6.553532 2.527967 -2.088456
## HOU IND JAX KC LAC LAR
## -6.641290 -10.254650 6.322584 3.026852 3.812193 8.064554
## MIA MIN NE NO NYG NYJ
## -6.114150 7.684512 8.859648 8.901039 -7.337259 -4.812183
## OAK PHI PIT SEA SF TB
## -4.545327 11.308393 4.522279 1.943337 -2.846817 -1.180602
## TEN WSH hfa
## -3.467773 -1.157492 2.422892
These are our team ratings. According to the model this season’s home field advantage is 2.42 points. If we want to get the rankings we sort the vector is descending order.
Here’s the top 10 teams in the NFL by our rating system.
tibble(teams = names(sse_ratings),
rankings = sse_ratings) %>%
arrange(desc(rankings)) %>%
head(10) %>%
kable()
teams | rankings |
---|---|
PHI | 11.308393 |
NO | 8.901039 |
NE | 8.859647 |
LAR | 8.064554 |
MIN | 7.684512 |
JAX | 6.322584 |
ATL | 5.543891 |
CAR | 4.610258 |
PIT | 4.522279 |
LAC | 3.812193 |
There’s a couple surprises. New Orleans is ranked second and Jaguars ranked lower than I’d expect but it’s standard looking top ten.
Remember we’re ignoring the fact that four of the games were played on a neutral field. Does accounting for neutral games improve the model?
paste("As of now the SSE is", round(mod$value,3))
## [1] "As of now the SSE is 37908.189"
To factor in neutral games we need to alter the forecast function
home_forecast_neutral <- function(home_rating,
visitor_rating,
home_edge,
is_neutral) {
ifelse(is_neutral,
home_rating - visitor_rating,
(home_edge + home_rating) - visitor_rating)
}
The code above checks if a game is neutral. If it is the home edge is removed from the formula otherwise it’s the same as before.
We also need to change the loss function.
sqr_err_neutral <- function(ratings) {
games %>%
mutate(forecast = home_forecast_neutral(ratings[home],
ratings[away],
ratings[33],
neutral)) %>% # 33 is the hfa
mutate(resid = (h_mov - forecast) ^ 2) %>%
summarise(sse = sum(resid)) %>%
pull(sse)
}
Let’s create a new ratings vector.
neutral_ratings <- rep(0,33)
names(neutral_ratings) <- c(teams, 'hfa') # hfa = home field advantage
Now we can re-run the model.
mod_neutral <- constrOptim.nl(par = neutral_ratings,
fn = sqr_err_neutral,
heq = set_team_avg_zero)
The difference between the two models isn’t worth the extra effort IMO.
tibble(with_neutral_mod = mod_neutral$value,
without_netural_mod = mod$value) %>%
kable()
with neutral mod | without netural mod |
---|---|
38036.65 | 37908.19 |
I’ll leave it up to you to see if there’s substantial differences in ranking.
While it’s interesting to compare results to other rating systems we’re really interested in predictions the model makes.
To do this we’ll create a function that predicts the home team’s margin
of victory by sticking the home_forecast
function inside another
function.
get_pointspred <- function(ratings, home, away){
home_forecast(ratings[home],
ratings[away],
ratings['hfa']) %>%
round(2)
}
Below is an example how you could use the ratings to predict point spreads. Obviously you want to predict games in the future but it works for an illustration.
nfl %>%
filter(date == '2018-02-04') %>%
select(date, home = team1, away = team2) %>%
mutate(spread = get_pointspred(mod$par, home, away)) %>%
kable()
date | home | away | spread |
---|---|---|---|
2018-02-04 | NE | PHI | -0.03 |
This shouldn't need to be said but you can never be too safe.
Do not make bets using the method alone.
If you do it's at your own risk.