Web Scraping NFL Data with R and the tidyverse

Published: May 16, 2018

Scraping data with R

text in italic TT hheisi sp aacfrkagitecls en eeded.

library(tidyverse)
library(rvest)
library(stringr)

Read Page from url

Sports data is all over the internet. Some of it is in a nice downloadable format. For example, it's possible download almost all of the information in CSV format from pro-football-reference (or one of their other sites like <baseball-reference.com>). Other times the data is in a nice format but there's no easy way to export or download it. It's this situation that we'll deal with.

A note about scrapping: Scrapping is a gray area. How legal is it? On the bad things you can do on the internet spectrum it's probably worse than creating fake logins but better than piracy. However, this isn't the first tutorial that details how to scrape data from a site. With that said, don't pound servers that don't belong to you. If you're going to scrape a lot of data at least put a reasonable delay between your requests.

For the example we're going to use NFL data gathered by <teamrankings.com>. Stats nicely formatted in table form makes importing data into R easy.

The plan is to scrape NFL stats, more precisely the team stats. Luckily, <teamrankings.com> has a minimal design. We can go to the NFL tab in the nav bar and copy the link for "All Stats".

We're going to assign the url to tr_url and use the read_html function from the rvest package to read the web page into R.

tr_url <- "https://www.teamrankings.com/nfl/team-stats"
tr <- read_html(tr_url)

Web scrapping is hard

In my experience most tutorials on web scrapping make it out to be easier than it really is. Most focus on one page but don't explain: how to scale up, how to deal with data on more than one page, selecting the proper pages from many, piecing together the data from the various sources (or pages), etc. I don't know if this explainer will be much better but I'll try.

Exploring the layout

The first thing we need to do is look at how the data is laid out on the site. If you go to the link you'll see that the stats are split between many pages. The individual pages are grouped by type, for example, "Total Offense", "Rushing Offense", "Scoring Defense." Expand one of the groups and the links to the individual pages are displayed.

Follow the link under "Scoring Offense" to Points Per Game, the data is in a HTML table.

The table contains the columns:

  • Rank
  • Team
  • 2017
  • Last 3
  • Last 1
  • Home
  • Away
  • 2016

Pick another stat from the list . I chose Two Point Conversion Percentage. The columns are the same. Do this a couple more times and it becomes clear that all the team stat pages for the NFL follow this format. So, if we can download the individual tables we shouldn't have much trouble binding them together.

Making a plan

Now we're ready to make a plan to scrape the data.

  1. Get all the links for the NFL team stats pages
  2. Decide which links are relevant for our data
  3. Go to the individual pages from step 2 and download the page into R
  4. Extract the html tables from the pages in step 3
  5. Bind the tables together
  6. Clean the data and put the table into a tidy format

Getting the links

We're going to take the team stats page we read in earlier, tr, pass that to the html_nodes. The output is piped into html_attr with argument "href" so we only get the hyper links with a URL.

tr_links <- tr %>% 
  html_nodes("a") %>% 
  html_attr("href")

head(tr_links,10)

##  [1] "/"                     "/login/"              
##  [3] "/register/"            "/"                    
##  [5] "#"                     "#"                    
##  [7] "#"                     "#"                    
##  [9] "#"                     "/football-pool-picks/"

You can see there's a lot of stuff in there we don't need like the login link or the links for other sports. One solution is to filter the list of links for links containing "nfl". But looking at the first results shows that not all "nfl" links are to the pages containing stats.

head(tr_links[str_detect(tr_links,"nfl")])

## [1] "/nfl-survivor-pool-picks/" "/nfl-win-picks/"          
## [3] "/nfl-ats-picks/"           "/nfl-over-under-picks/"   
## [5] "/nfl-money-line-picks/"    "/nfl-betting-picks/"

Looking at urls of the individual stat pages on https://www.teamrankings.com/nfl/team-stats we notice the pattern "nfl/stat" in all team stat urls. Check to see how many urls are in the tr_links vector and how many of those are NFL stat pages.

print(length(tr_links))

## [1] 694

print(length(tr_links[str_detect(tr_links,"nfl/stat")]))

## [1] 216

About a third of the pages are stat related.

Let's create a new vector nfl_links to keep things clear.

nfl_links <- tr_links[str_detect(tr_links,"nfl/stat")]
head(nfl_links)

## [1] "/nfl/stats/"                      "/nfl/stats/"                     
## [3] "/nfl/stat/points-per-game"        "/nfl/stat/average-scoring-margin"
## [5] "/nfl/stat/yards-per-point"        "/nfl/stat/yards-per-point-margin"

Looking at the results we notice another issue. This time the overview page (the one we're on currently) is listed in the results. There's no stats on this page. We need to remove those results from the vector, but we're not sure exactly how many more times "nfl/stats/" is in the nfl_links vector. We'll use the same method we used above with str_detect but instead we will select the results that come back FALSE

nfl_links <- nfl_links[!str_detect(nfl_links, "/nfl/stats/")]

# We could do it this way as well
# nfl_links %in% "/nfl/stats/"

Put links in dataframe

We want to keep the entire process tidy, let's start by putting the stat links in a dataframe (or in this case a tibble). It might not be clear now but this will make things a lot easier a few steps from now.

df <- tibble(stat_links = nfl_links)

Find per play stats

We could pull the tables from every page but that would be 212 different pages to scrape. IMO overkill for the example. Plus, there wouldn't be much structure to the final table. Instead, we will focus on pulling all the per-play stats. Our final table will be the per-play stats for each team for every per-play category on the teamrankings site.

The way we accomplish this is using the same str_detect method as above. Create a column using mutate called is_per_play the column will contain the logical results from str_detect. If the stat_links column contains "per-play" is_per_play is TRUE. Next filter only the results where is_per_play is TRUE.

The dataframe is down to 11 rows. These 11 urls are the pages we're going to scrape.

df <- df %>% 
  mutate(is_per_play = str_detect(stat_links, "per-play")) %>% 
  filter(is_per_play == TRUE)

glimpse(df)

## Observations: 11
## Variables: 2
## $ stat_links  <chr> "/nfl/stat/points-per-play", "/nfl/stat/points-per...
## $ is_per_play <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR...

Fixing the links

You've probably noticed that the urls aren't absolute but relative urls. We're going to need to add the <teamrankings.com> to the front of the urls in the stat_links columns.

Now that we have the correct path we paste it to the front of the stat_links column. Notice that we're still keeping all the results in the same dataframe.

df <- df %>% 
  mutate(url = paste0('https://www.teamrankings.com', stat_links))

df %>% 
  head() %>% 
  knitr::kable()
stat_links is_per_play url
/nfl/stat/points-per-play TRUE https://www.teamrankings.com/nfl/stat/points-per-play
/nfl/stat/points-per-play-margin TRUE https://www.teamrankings.com/nfl/stat/points-per-play-margin
/nfl/stat/yards-per-play TRUE https://www.teamrankings.com/nfl/stat/yards-per-play
/nfl/stat/first-downs-per-play TRUE https://www.teamrankings.com/nfl/stat/first-downs-per-play
/nfl/stat/punts-per-play TRUE https://www.teamrankings.com/nfl/stat/punts-per-play
/nfl/stat/opponent-points-per-play TRUE https://www.teamrankings.com/nfl/stat/opponent-points-per-play

Downloading the web page

Now, the next piece of code isn't best of way doing things. I tried to come up with ways using list columns in dataframes. I've come up with solutions using map and walk from the purrr package but they were "too cute" to be practical. IMO the code below is readable and works for the task at hand so we'll leave it as is. I got the idea for using sample with Sys.sleep() from this Bob Rudis blog post.

get_page <- function(url){
  page <- read_html(url)
  Sys.sleep(sample(seq(.25,2.5,.25),1))
  page
}

page_data <- map(df$url, get_page)

Now we have a list with 11 items where every item is a web page downloaded via read_html. We can easily extract the html tables from the pages using html_table. Instead of a loop this time we'll use map.

tr_data <- map(page_data, html_table)

The structure of tr_data is a little nasty. It's an 11 item list where every item is a list of length 1.

print(length(tr_data))

## [1] 11

print(map_dbl(tr_data, length))

##  [1] 1 1 1 1 1 1 1 1 1 1 1

print(map_chr(tr_data, class))

##  [1] "list" "list" "list" "list" "list" "list" "list" "list" "list" "list"
## [11] "list"

Binding all the tables together.

Let's convert the list into one tibble. The code below looks ugly but it's relatively easy to follow.

  1. Use pluck to "pull out" the main list from tr_data
  2. Pass that list to map2_df
  3. From the code above we see that the list is eleven items long. That makes sense since we passed eleven urls to get_page. Now we're going to take those eleven urls and pass them as the second item to iterate over in map2_df.
  4. Each item from the tr_data list is converted to a tibble and then a new column is added with the URL page that the new tibble was created from. (since tr_data was created from the links in df$stat_links they line up perfectly).
  5. Since we called map2_df instead of map2 the eleven tibbles are binded into one.
  6. We pass this new tibble to set_names to make the column names more R friendly.
tr_data <- pluck(tr_data, 1) %>% 
  map2_df(df$stat_links, 
          ~as_tibble(.x) %>% 
            mutate(stat = .y)) %>% 
  set_names(c(
  'rank',
  'team',
  'current_seas',
  'last_3',
  'last_1',
  'home',
  'away',
  'last_seas',
  'stat'
  ))

Let's look at the result. We have 9 columns. current_seas and last_seas are the stats for 2017 and 2016 respectively. stat is the type of football statistic referenced. Because the season is only four weeks in (at the time of writing this) let's drop the last_3 and last_1 columns.

tr_data <- tr_data %>% 
  select(-last_3, -last_1)

Even after the cleanup this isn't the most efficient format for analyzing the data. We want the stats to be the columns and the teams to be repeated four times, for the years and the locations. The text format of the stat column isn't very R friendly for column names.

glimpse(tr_data)

## Observations: 352
## Variables: 7
## $ rank         <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15...
## $ team         <chr> "LA Rams", "New Orleans", "Philadelphia", "New En...
## $ current_seas <dbl> 0.461, 0.448, 0.434, 0.429, 0.420, 0.418, 0.397, ...
## $ home         <dbl> 0.396, 0.486, 0.436, 0.416, 0.388, 0.457, 0.421, ...
## $ away         <dbl> 0.518, 0.410, 0.431, 0.444, 0.457, 0.383, 0.366, ...
## $ last_seas    <dbl> 0.233, 0.424, 0.340, 0.423, 0.390, 0.341, 0.383, ...
## $ stat         <chr> "/nfl/stat/points-per-play", "/nfl/stat/points-pe...

Changing the stat names.

There's more than one way to accomplish most things in R. This might not be the best way. I find it to be cleaner than using a function like recode. We're going to create a new dataframe with the unique values of the stat column and the new names as the other. We're going to join that dataframe to the tr_data and remove the original stat column.

First let's look at the stat names.

raw_stat_names
/nfl/stat/points-per-play
/nfl/stat/points-per-play-margin
/nfl/stat/yards-per-play
/nfl/stat/first-downs-per-play
/nfl/stat/punts-per-play
/nfl/stat/opponent-points-per-play
/nfl/stat/opponent-yards-per-play
/nfl/stat/opponent-first-downs-per-play
/nfl/stat/opponent-punts-per-play
/nfl/stat/penalties-per-play
/nfl/stat/opponent-penalties-per-play

These clearly aren't R friendly column names. Some cleanup is needed.

new_col_names <- raw_stat_names %>% 
  str_replace_all('/nfl/stat/','') %>% 
  str_replace_all('-','_') %>% 
  str_replace_all('points','pts') %>% 
  str_replace_all('yards','yds') %>% 
  str_replace_all('opponent','opp') %>% 
  str_replace_all('per_play','pp')
  
df_col_names <- 
  tibble(stat = unique(tr_data$stat),
       new_cols = new_col_names)

df_col_names %>% 
  knitr::kable()
stat new_cols
/nfl/stat/points-per-play pts_pp
/nfl/stat/points-per-play-margin pts_pp_margin
/nfl/stat/yards-per-play yds_pp
/nfl/stat/first-downs-per-play first_downs_pp
/nfl/stat/punts-per-play punts_pp
/nfl/stat/opponent-points-per-play opp_pts_pp
/nfl/stat/opponent-yards-per-play opp_yds_pp
/nfl/stat/opponent-first-downs-per-play opp_first_downs_pp
/nfl/stat/opponent-punts-per-play opp_punts_pp
/nfl/stat/penalties-per-play penalties_pp
/nfl/stat/opponent-penalties-per-play opp_penalties_pp

note: for the example below we're going to remove the Home and Away columns but you don't need to.

Now we remove the stat column and then rename the new column we added stat. Next we gather the dataframe by the season creating two new columns year and val. This leaves us with a dataframe with four columns; Team, stat, year, val. We want the values in the stat column to be the new column names. Call spread on stat and use the values in val. Our new dataframe is 64 rows long (32 teams x 2 seasons) and 13 columns wide (the 11 stats + year + Team). Fix the year column by replacing "X2015"/"X2016" with 2015/2016.

tr_data <- tr_data %>% 
  left_join(df_col_names, by = "stat") %>% 
  select(-stat, -home, -away,-rank) %>% 
  rename(stat = new_cols) %>% 
  gather(year, val, current_seas:last_seas) %>% 
  spread(stat, val) %>% 
  mutate(year = recode(year, 
                       "last_seas" = 2016, 
                       "current_seas" = 2017)) %>% 
  arrange(year, team)

glimpse(tr_data)

## Observations: 64
## Variables: 13
## $ team               <chr> "Arizona", "Atlanta", "Baltimore", "Buffalo...
## $ year               <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2...
## $ first_downs_pp     <dbl> 0.339, 0.385, 0.293, 0.324, 0.302, 0.335, 0...
## $ opp_first_downs_pp <dbl> 0.288, 0.343, 0.296, 0.332, 0.318, 0.318, 0...
## $ opp_penalties_pp   <dbl> 0.06, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0...
## $ opp_pts_pp         <dbl> 0.355, 0.379, 0.323, 0.371, 0.388, 0.394, 0...
## $ opp_punts_pp       <dbl> 0.08, 0.06, 0.09, 0.07, 0.07, 0.07, 0.07, 0...
## $ opp_yds_pp         <dbl> 4.8, 5.6, 5.2, 5.6, 5.6, 5.5, 5.4, 5.9, 5.5...
## $ penalties_pp       <dbl> 0.05, 0.05, 0.06, 0.06, 0.05, 0.06, 0.04, 0...
## $ pts_pp             <dbl> 0.385, 0.550, 0.318, 0.394, 0.351, 0.289, 0...
## $ pts_pp_margin      <dbl> 0.030, 0.171, -0.005, 0.023, -0.038, -0.106...
## $ punts_pp           <dbl> 0.07, 0.05, 0.07, 0.07, 0.07, 0.07, 0.07, 0...
## $ yds_pp             <dbl> 5.4, 6.7, 5.2, 5.6, 5.2, 5.9, 5.4, 5.1, 6.0...

Now our data is in a tidy format and ready for analysis.

Here's the dataset teamrankings-dataset.csv