Published: May 18, 2018
We are going to come up with a number for home field advantage (pretend we don't already know that 3 points is a pretty solid estimate).
Since we only need team, score, and location we could use just about any dataset. For NFL data it's hard to beat ArmchairAnalysis for quality and price.
The GAME.csv
file contains all the information we're looking for
(which isn't much more than team scores).
A quick overview of the code below:
file.path
creates a path to a file or directory in a platform
indepenent way. ~
on a mac is shorthand for your home directorydir
lists all the files in the directory passed to it. We have it
return only files with 'GAME' in it (one in this case), and return
the full path to the file.library(tidyverse)
file_path <-
file.path("~",
"Dropbox (Personal)",
"data_sets",
"ArmchairAnalysis",
"nfl_00-17")
game_file <- dir(file_path, full.names = TRUE, pattern = 'GAME')
nfl <- read_csv(game_file)
nfl %>% glimpse()
## Observations: 4,790
## Variables: 17
## $ gid <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17...
## $ seas <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 200...
## $ wk <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, ...
## $ day <chr> "SUN", "SUN", "SUN", "SUN", "SUN", "SUN", "SUN", "SUN", "...
## $ v <chr> "SF", "JAC", "PHI", "NYJ", "IND", "SEA", "CHI", "TB", "DE...
## $ h <chr> "ATL", "CLE", "DAL", "GB", "KC", "MIA", "MIN", "NE", "NO"...
## $ stad <chr> "Georgia Dome", "Cleveland Browns Stadium", "Texas Stadiu...
## $ temp <int> 79, 78, 109, 77, 90, 89, 65, 71, 89, 80, 64, 74, 80, 73, ...
## $ humd <int> NA, 63, 19, 66, 50, 59, NA, 93, NA, 79, 49, 84, 98, 78, N...
## $ wspd <int> NA, 9, 5, 5, 8, 13, NA, 5, NA, 3, 10, 8, NA, 10, NA, 5, 8...
## $ wdir <chr> NA, "NE", "S", "E", "E", "E", NA, "VAR", NA, "VAR", "W", ...
## $ cond <chr> "Dome", "Sunny", "Sunny", "Mostly Cloudy", "Mostly Sunny"...
## $ surf <chr> "AstroTurf", "Grass", "AstroTurf", "Grass", "Grass", "Gra...
## $ ou <dbl> 42.5, 38.0, 40.0, 36.0, 44.0, 36.0, 47.0, 35.5, 39.5, 40....
## $ sprv <dbl> 7.0, -10.0, 6.0, 2.5, -3.0, 3.0, 4.5, -3.0, 1.0, 7.0, 7.0...
## $ ptsv <int> 28, 27, 41, 20, 27, 0, 27, 21, 14, 16, 6, 16, 17, 13, 36,...
## $ ptsh <int> 36, 7, 14, 16, 14, 23, 30, 16, 10, 21, 9, 0, 20, 16, 41, ...
Now that we have the NFL data loaded up and took a glimpse
of it we
can begin to calculate HFA for different teams and time frames.
The nfl
dataset goes back to 2000 and up through 2017. Currently the
NFL has 32 teams its been that way since 2002. Because of that we will
filter out the first two seasons of the nfl
dataset so we can compare
"apples to apples."
nfl <- filter(nfl, seas >= 2002)
Also, the Rams decided to move to LA. This really screws with our records. Normally I'd change any "STL" record to "LA" because we're still dealing with the same team. But in this case where the focus is location I've decided to leave as is.
We need to make a few more choices before we begin. With rare exceptions NFL teams don't play on a neutral field meaning one of the two teams is at home. The biggest exception to this rule is the Super Bowl. Therefore Super Bowl results need to be filtered out of the dataframe.
The nfl
dataframe contains a column wk
which stands for week.
Looking at the unique values we can isolate the week of the Super Bowl
(in theory it should be the last week). It's also important to see if
the dataset contains pre-season games or not, if the data has pre-season
those games should be fitered out. Our nfl
data doesn't have
pre-season.
unique(nfl$wk)
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Week 21 should be the Super Bowl each season so if the dataframe is
filtered on wk
21 we should see the Super Bowl matchups for the past
16 years. My memory isn't great so we will arrange the dataframe so that
the most recent games are on top.
nfl %>%
filter(wk == 21) %>%
arrange(desc(seas)) %>%
select(seas, wk, v, h)
## # A tibble: 16 x 4
## seas wk v h
## <int> <int> <chr> <chr>
## 1 2017 21 PHI NE
## 2 2016 21 NE ATL
## 3 2015 21 CAR DEN
## 4 2014 21 NE SEA
## 5 2013 21 SEA DEN
## 6 2012 21 BAL SF
## 7 2011 21 NYG NE
## 8 2010 21 PIT GB
## 9 2009 21 NO IND
## 10 2008 21 PIT ARI
## 11 2007 21 NYG NE
## 12 2006 21 IND CHI
## 13 2005 21 SEA PIT
## 14 2004 21 NE PHI
## 15 2003 21 CAR NE
## 16 2002 21 OAK TB
Looking at the list confirms that wk
21 is the Super Bowl. Going
forward that week will be filtered out of all calculations.
nfl <- filter(nfl, wk != 21)
The next choice is whether or not to include playoff games. A case can be made for either side of this issue but this isn't the place for that discussion. We will include playoff games in the calculations. If we decide to run the calculations again without playoff games we'd filter out everything above week 17.
There's one last problem that needs to be dealt with if we're concerned
with accurate numbers, international games. Since 2007 the NFL has
played a number of regular season games outside the United
States. Most of
these games have been played at Wembley Stadium in London but a few have
been played at Twickenham, London or Estadio Azteca in Mexico City. None
of these games can be considered a "home" game for either team
regardless of what color shirt they wear. The Bills played a series of
games in Toronto
from 2008 to 2012, these remove these. The dataset has a column stad
.
Filter out the games played in these stadiums.
nfl <- nfl %>%
filter(!stad %in% c("Azteca Stadium",
"Wembley Stadium",
"Rogers Centre",
"Twickenham, London"))
Now, there's probably a game or two that was played on a neutral field that we aren't catching. A game got rescheduled, a natural disaster, etc. That might have a small impact when looking at HFA on a per team basis but in the overall picture it shouldn't matter much now that we have removed that 25 or so games on a neutral site.
After taking out all of those games, what's the average HFA considering all teams since the 2000 season?
There's a few ways to calculate home field advantage.
The most logical way is to add up all the home scores and add up all the visitor scores. Subtract the visitor sum from the home sum and divide that number by the total number of games.
(sum(nfl$ptsh) - sum(nfl$ptsv)) / dim(nfl)[1]
## [1] 2.637137
We can get the same answer creating a margin of victory column for each
game, which we will call h_mov
and then take the mean of the mov
column.
nfl %>%
mutate(h_mov = ptsh - ptsv) %>%
summarise(hfa = mean(h_mov)) %>%
pull(hfa)
## [1] 2.637137
Over the last 14 years the average HFA is approximately 2.6 points. Has the NFL changed in the last few years? Does the HFA hold steady?
One thing we can do is to look at the HFA per season and see if anything has changed over time.
(hfa_by_yr <- nfl %>%
mutate(h_mov = ptsh - ptsv) %>%
group_by(seas) %>%
summarise(hfa = round(mean(h_mov),2)) %>%
ungroup()) %>%
knitr::kable()
seas | hfa |
---|---|
2002 | 2.49 |
2003 | 3.59 |
2004 | 2.67 |
2005 | 3.45 |
2006 | 1.05 |
2007 | 2.97 |
2008 | 2.42 |
2009 | 2.61 |
2010 | 1.66 |
2011 | 3.46 |
2012 | 2.75 |
2013 | 3.20 |
2014 | 2.85 |
2015 | 1.42 |
2016 | 3.03 |
2017 | 2.58 |
Sometimes it's hard to see trends just looking at a table of numbers.
hfa_by_yr %>%
ggplot(., aes(as.factor(seas), hfa)) +
geom_bar(stat = "identity",
fill = "azure3",
col = "white") +
labs(x = "Season",
y = "Home Field Advantage") +
scale_x_discrete(breaks = seq(2000,2015,3)) +
scale_y_continuous(breaks = seq(0,4,.5)) +
theme_minimal()
The HFA looks like it jumps around quite a bit with a low of below one point in 2006 and over three and a half points in 2003. 2015 was a below average year with the HFA registering slightly below one and a half points but 2016 jumps back to the global average. A case could be made that the last five seasons have seen a decline in the value of HFA but it's still a little too early to tell.
Small Detour: Here's an example of how easy it is to create "trends" by cherry picking start and end dates.
hfa_by_yr %>%
filter(seas > 2010) %>%
ggplot(aes(seas, hfa)) +
geom_point() +
geom_smooth(method = "lm",
se = FALSE) +
theme_minimal()
Using the graph above you can make the case that HFA is becoming less of a factor in the NFL, while the truth is it's pretty stable.
At first glance while it might make sense to calculate home field advantage using the above method it's not the correct way to do so. Getting to a solid HFA number isn't as easy as we made it look above.
One question we need to ask is what is home field advantage. At the most simple level it is the advantage in win probability (which we convert into points) that a team gets while playing on their home field. This might sound simplistic but it's home field advantage because teams tend to perform better at home then on the road.
Imagine an idealized game where both teams are evenly matched. The expectation is that if they were to play 1000 games on a neutral field each team would win 500 times. What happens if we take the game from the neutral field to either team's home field? Based on past NFL performance we would expect the home team to win by roughly three points. That's home field advantage. Since teams are rarely evenly matched it's not as easy as saying the home team will win by three points. Ultimatley, we are concerned with HFA impact on betting therefore let's look at how HFA effect the point spread.
The Dolphins are playing the Saints. On a neutral field we would make the Saints a four point favorite (again meaning that if the Saints and Dolphins played 1000 times on a neutral field we'd expect the Saints on average to win by four points).
Imagine two games one where the Saints are at home and the other where the Dolphins are the home team.
home team = bottom team
Team | Spread | Team | Spread |
---|---|---|---|
Dolphins | + 7 | ||
Saints | - 7 | ||
Saints | -1 | ||
Dolphins | +1 |
The tables above show how home field impact betting lines. The neutral field line is Saints -4. If we accept the average HFA in the NFL is approximately 3 points then when the Saints are home they should win by seven points. However, when the game is played in Miami the Dolphins should perform better and the line is Saints minus one. That's a six point swing from one location to another.
Now we will see another way to calculate home field advantage and do it by team.
Is HFA uniform across teams? Some teams play better than others at home to determine which teams have the biggest HFA we need to calculate some statistics.
What do we need to come up with the HFA per team?
Unfortunate for us, the dataset we're working with isn't in a suitable format to calculate those numbers. This means we're going to have to do some data wrangling and tidying before we can come up with a HFA per team.
Below is the code that generates HFA per team since 2002 thru 2016. It doesn't pre-season, international, or neutral site games (or at least we try not to include theme).
First we create a small helper function to view the biggest and smallest home field advantage by team. Below the code and output we will go over each line and see how to both wrangle the data and calculate the HFA.
Before we do that let's to a peak at the dataset
nfl %>%
glimpse()
## Observations: 4,233
## Variables: 17
## $ gid <int> 519, 520, 521, 522, 523, 524, 525, 526, 527, 528, 529, 53...
## $ seas <int> 2002, 2002, 2002, 2002, 2002, 2002, 2002, 2002, 2002, 200...
## $ wk <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, ...
## $ day <chr> "THU", "SUN", "SUN", "SUN", "SUN", "SUN", "SUN", "SUN", "...
## $ v <chr> "SF", "NYJ", "BAL", "MIN", "SD", "KC", "STL", "ATL", "IND...
## $ h <chr> "NYG", "BUF", "CAR", "CHI", "CIN", "CLE", "DEN", "GB", "J...
## $ stad <chr> "Giants Stadium", "Ralph Wilson Stadium", "Ericsson Stadi...
## $ temp <int> 73, 86, 78, 85, 90, 86, 83, 83, 85, 87, 70, 90, 83, 76, N...
## $ humd <int> 49, 75, 59, NA, 60, 48, 33, 64, NA, 69, 47, 57, 60, 71, N...
## $ wspd <int> 7, 6, 12, 3, 7, 3, 4, 9, 16, 12, 7, 6, NA, 5, NA, 10, NA,...
## $ wdir <chr> "N", "SW", "N", "NW", "N", "VAR", "E", "SW", "E", "E", "W...
## $ cond <chr> "Fair", "Mostly Sunny", "Partly Cloudy", "Partly Cloudy",...
## $ surf <chr> "Grass", "AstroTurf", "Grass", "AstroPlay", "Grass", "Gra...
## $ ou <dbl> 38.5, 40.5, 33.5, 40.5, 37.5, 36.0, 52.0, 42.5, 43.5, 35....
## $ sprv <dbl> -3.5, -3.0, -2.0, 4.5, 2.5, 2.0, -3.0, 7.0, -3.5, 8.0, 7....
## $ ptsv <int> 16, 37, 7, 23, 34, 40, 16, 34, 28, 21, 17, 26, 24, 23, 10...
## $ ptsh <int> 13, 31, 10, 27, 6, 39, 23, 37, 25, 49, 31, 20, 27, 31, 19...
Here's our helper function to view both the top and bottom results at the same time.
top_and_bottom <- function(.df, top = 5, bottom = 5) {
.df %>%
filter(row_number() <= top | row_number() > (n() - bottom))
}
# make sure the dataframe passed to top_and_bottom is sorted
Now we will calculate HFA by team. The five teams with the biggest and five teams with the smallest HFA are shown in the table output below the code.
hfa_by_team <- nfl %>%
gather(location, team, c(h, v)) %>%
mutate(mov_h = ptsh - ptsv) %>%
group_by(team, location) %>%
summarise(mov_avg = mean(mov_h)) %>%
ungroup() %>%
mutate(mov_avg = ifelse(location == "v",
mov_avg * -1, mov_avg)) %>%
spread(location, mov_avg) %>%
mutate(mov_spread = round(h,2) - round(v,2),
hfa = round(mov_spread/2,1)) %>%
arrange(desc(hfa)) %>%
select(team, hfa)
top_and_bottom(hfa_by_team) %>%
knitr::kable()
team | hfa |
---|---|
LAC | 4.7 |
SEA | 4.5 |
ARI | 4.3 |
BAL | 4.1 |
SF | 3.8 |
TB | 1.4 |
WAS | 1.3 |
CAR | 1.2 |
NYG | 1.2 |
LA | -1.3 |
Often it's easier to get a feel for numbers by plotting them rather than looking at a table. note: The plot below might seem strange to some since they're used to seeing such data displayed in barcharts. While the barchart works for this data, IMO visually it's overkill. Simple dots along a number line aligning with the team gets the point across (pun is on purpose) while spilling less ink.
hfa_avg <- mean(hfa_by_team$hfa)
hfa_by_team %>%
ggplot(., aes(reorder(team,hfa), hfa )) +
geom_point(col = "slateblue3", size = 2, shape = 1) +
coord_flip() +
labs(x = "Team",
y = "Home Field Advantage",
title = "Home Field Advantage by Team: 2002 - 2017") +
scale_y_continuous(breaks = seq(1,5,.5)) +
geom_hline(yintercept = hfa_avg,
linetype = 2, col = "slategray3") +
annotate("text", y = 2.83, x = 4,
label = "League Average", size = 2.5,
color = "gray11") +
theme_minimal(base_size = 8)
Like just about everything in sports betting and stats, to my knowledge there's no exact consensus on how to calculate HFA in the NFL. There's a few articles that I've checked my work against. The first is an article by Chase Stuart at footballperspective.com. He uses the basic formula:
(Home point differential – Road point differential) / 2
For the most part my numbers match up with his. There's also a quality
article at
pinnacle.com
by Mark Taylor. His take is slightly
different. Bill Barnwell's Grantland
article
uses (Home PD - Road PD) / 2
method.
I don't like using decimals because it implies precise knowledge. No one knows a team's true HFA, especially not to the second or third decimal. That's true for almost all stats. If you look across the articles above and around the net on HFA you'll notice that the numbers don't match up. No one has the same DB, some people include playoffs, some don't. Some don't include old stadiums, etc. There's a lot of choices that need to be made and each one impacts these numbers. These are just estimates. Treat them as such.
nfl %>%
gather(location, team, c(h, v)) %>%
mutate(mov_h = ptsh - ptsv) %>%
group_by(team, location) %>%
summarise(mov_avg = mean(mov_h)) %>%
ungroup() %>%
mutate(mov_avg = ifelse(location == "v",
mov_avg * -1, mov_avg)) %>%
spread(location, mov_avg) %>%
mutate(diff = round(h,2) - round(v,2),
hfa = round(diff/2,1)) %>%
arrange(desc(hfa)) %>%
rename(hm_pt_diff = h, aw_pt_diff = v) %>%
knitr::kable(digits = 3)
team | hm pt diff | aw pt diff | diff | hfa |
---|---|---|---|---|
LAC | 9.875 | 0.500 | 9.38 | 4.7 |
SEA | 7.288 | -1.647 | 8.94 | 4.5 |
ARI | 1.931 | -6.606 | 8.54 | 4.3 |
BAL | 7.394 | -0.755 | 8.15 | 4.1 |
SF | 1.366 | -6.153 | 7.52 | 3.8 |
STL | -1.205 | -8.860 | 7.65 | 3.8 |
MIN | 4.153 | -3.143 | 7.29 | 3.6 |
BUF | 1.549 | -5.391 | 6.94 | 3.5 |
GB | 7.442 | 0.518 | 6.92 | 3.5 |
DET | -0.547 | -7.403 | 6.85 | 3.4 |
NYJ | 2.186 | -4.401 | 6.59 | 3.3 |
HOU | 0.697 | -5.519 | 6.22 | 3.1 |
IND | 5.496 | -0.277 | 5.78 | 2.9 |
CHI | 1.639 | -4.024 | 5.66 | 2.8 |
DEN | 4.416 | -1.238 | 5.66 | 2.8 |
JAC | 0.177 | -5.098 | 5.28 | 2.6 |
KC | 2.924 | -2.280 | 5.20 | 2.6 |
SD | 5.392 | 0.424 | 4.97 | 2.5 |
ATL | 3.331 | -1.545 | 4.87 | 2.4 |
PIT | 6.899 | 2.125 | 4.78 | 2.4 |
NO | 4.549 | -0.030 | 4.58 | 2.3 |
TEN | 0.131 | -4.246 | 4.38 | 2.2 |
NE | 11.020 | 7.212 | 3.81 | 1.9 |
PHI | 5.288 | 1.421 | 3.87 | 1.9 |
CLE | -3.859 | -7.380 | 3.52 | 1.8 |
DAL | 2.576 | -1.084 | 3.66 | 1.8 |
MIA | -0.134 | -3.646 | 3.52 | 1.8 |
OAK | -3.302 | -6.891 | 3.59 | 1.8 |
CIN | 1.160 | -1.779 | 2.94 | 1.5 |
TB | -0.457 | -3.318 | 2.86 | 1.4 |
WAS | -1.362 | -3.953 | 2.59 | 1.3 |
CAR | 1.881 | -0.437 | 2.32 | 1.2 |
NYG | 1.031 | -1.333 | 2.36 | 1.2 |
LA | -2.294 | 0.312 | -2.60 | -1.3 |