Published: May 16, 2018
text in italic TT hheisi sp aacfrkagitecls en eeded.
library(tidyverse)
library(rvest)
library(stringr)
Sports data is all over the internet. Some of it is in a nice downloadable format. For example, it's possible download almost all of the information in CSV format from pro-football-reference (or one of their other sites like <baseball-reference.com>). Other times the data is in a nice format but there's no easy way to export or download it. It's this situation that we'll deal with.
A note about scrapping: Scrapping is a gray area. How legal is it? On the bad things you can do on the internet spectrum it's probably worse than creating fake logins but better than piracy. However, this isn't the first tutorial that details how to scrape data from a site. With that said, don't pound servers that don't belong to you. If you're going to scrape a lot of data at least put a reasonable delay between your requests.
For the example we're going to use NFL data gathered by <teamrankings.com>. Stats nicely formatted in table form makes importing data into R easy.
The plan is to scrape NFL stats, more precisely the team stats. Luckily, <teamrankings.com> has a minimal design. We can go to the NFL tab in the nav bar and copy the link for "All Stats".
We're going to assign the url to tr_url
and use the read_html
function from the rvest
package to read the web page into R.
tr_url <- "https://www.teamrankings.com/nfl/team-stats"
tr <- read_html(tr_url)
In my experience most tutorials on web scrapping make it out to be easier than it really is. Most focus on one page but don't explain: how to scale up, how to deal with data on more than one page, selecting the proper pages from many, piecing together the data from the various sources (or pages), etc. I don't know if this explainer will be much better but I'll try.
The first thing we need to do is look at how the data is laid out on the site. If you go to the link you'll see that the stats are split between many pages. The individual pages are grouped by type, for example, "Total Offense", "Rushing Offense", "Scoring Defense." Expand one of the groups and the links to the individual pages are displayed.
Follow the link under "Scoring Offense" to Points Per Game, the data is in a HTML table.
The table contains the columns:
Pick another stat from the list . I chose Two Point Conversion Percentage. The columns are the same. Do this a couple more times and it becomes clear that all the team stat pages for the NFL follow this format. So, if we can download the individual tables we shouldn't have much trouble binding them together.
Now we're ready to make a plan to scrape the data.
We're going to take the team stats page we read in earlier, tr
, pass
that to the html_nodes
. The output is piped into html_attr
with
argument "href"
so we only get the hyper links with a URL.
tr_links <- tr %>%
html_nodes("a") %>%
html_attr("href")
head(tr_links,10)
## [1] "/" "/login/"
## [3] "/register/" "/"
## [5] "#" "#"
## [7] "#" "#"
## [9] "#" "/football-pool-picks/"
You can see there's a lot of stuff in there we don't need like the login link or the links for other sports. One solution is to filter the list of links for links containing "nfl". But looking at the first results shows that not all "nfl" links are to the pages containing stats.
head(tr_links[str_detect(tr_links,"nfl")])
## [1] "/nfl-survivor-pool-picks/" "/nfl-win-picks/"
## [3] "/nfl-ats-picks/" "/nfl-over-under-picks/"
## [5] "/nfl-money-line-picks/" "/nfl-betting-picks/"
Looking at urls of the individual stat pages on
https://www.teamrankings.com/nfl/team-stats we notice the pattern
"nfl/stat" in all team stat urls. Check to see how many urls are in the
tr_links
vector and how many of those are NFL stat pages.
print(length(tr_links))
## [1] 694
print(length(tr_links[str_detect(tr_links,"nfl/stat")]))
## [1] 216
About a third of the pages are stat related.
Let's create a new vector nfl_links
to keep things clear.
nfl_links <- tr_links[str_detect(tr_links,"nfl/stat")]
head(nfl_links)
## [1] "/nfl/stats/" "/nfl/stats/"
## [3] "/nfl/stat/points-per-game" "/nfl/stat/average-scoring-margin"
## [5] "/nfl/stat/yards-per-point" "/nfl/stat/yards-per-point-margin"
Looking at the results we notice another issue. This time the overview
page (the one we're on currently) is listed in the results. There's no
stats on this page. We need to remove those results from the vector, but
we're not sure exactly how many more times "nfl/stats/" is in the
nfl_links
vector. We'll use the same method we used above with
str_detect
but instead we will select the results that come back
FALSE
nfl_links <- nfl_links[!str_detect(nfl_links, "/nfl/stats/")]
# We could do it this way as well
# nfl_links %in% "/nfl/stats/"
We want to keep the entire process tidy, let's start by putting the stat links in a dataframe (or in this case a tibble). It might not be clear now but this will make things a lot easier a few steps from now.
df <- tibble(stat_links = nfl_links)
We could pull the tables from every page but that would be 212 different pages to scrape. IMO overkill for the example. Plus, there wouldn't be much structure to the final table. Instead, we will focus on pulling all the per-play stats. Our final table will be the per-play stats for each team for every per-play category on the teamrankings site.
The way we accomplish this is using the same str_detect
method as
above. Create a column using mutate
called is_per_play
the column
will contain the logical results from str_detect
. If the stat_links
column contains "per-play" is_per_play
is TRUE
. Next filter only the
results where is_per_play
is TRUE
.
The dataframe is down to 11 rows. These 11 urls are the pages we're going to scrape.
df <- df %>%
mutate(is_per_play = str_detect(stat_links, "per-play")) %>%
filter(is_per_play == TRUE)
glimpse(df)
## Observations: 11
## Variables: 2
## $ stat_links <chr> "/nfl/stat/points-per-play", "/nfl/stat/points-per...
## $ is_per_play <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR...
You've probably noticed that the urls aren't absolute but relative urls.
We're going to need to add the <teamrankings.com> to the front of
the urls in the stat_links
columns.
Now that we have the correct path we paste it to the front of the
stat_links
column. Notice that we're still keeping all the results in
the same dataframe.
df <- df %>%
mutate(url = paste0('https://www.teamrankings.com', stat_links))
df %>%
head() %>%
knitr::kable()
stat_links | is_per_play | url |
---|---|---|
/nfl/stat/points-per-play | TRUE | https://www.teamrankings.com/nfl/stat/points-per-play |
/nfl/stat/points-per-play-margin | TRUE | https://www.teamrankings.com/nfl/stat/points-per-play-margin |
/nfl/stat/yards-per-play | TRUE | https://www.teamrankings.com/nfl/stat/yards-per-play |
/nfl/stat/first-downs-per-play | TRUE | https://www.teamrankings.com/nfl/stat/first-downs-per-play |
/nfl/stat/punts-per-play | TRUE | https://www.teamrankings.com/nfl/stat/punts-per-play |
/nfl/stat/opponent-points-per-play | TRUE | https://www.teamrankings.com/nfl/stat/opponent-points-per-play |
Now, the next piece of code isn't best of way doing things. I tried to
come up with ways using list columns in dataframes. I've come up with
solutions using map
and walk
from the purrr
package but they were
"too cute" to be practical. IMO the code below is readable and works for
the task at hand so we'll leave it as is. I got the idea for using
sample
with Sys.sleep()
from this Bob Rudis blog
post.
get_page <- function(url){
page <- read_html(url)
Sys.sleep(sample(seq(.25,2.5,.25),1))
page
}
page_data <- map(df$url, get_page)
Now we have a list with 11 items where every item is a web page
downloaded via read_html
. We can easily extract the html tables from
the pages using html_table
. Instead of a loop this time we'll use
map
.
tr_data <- map(page_data, html_table)
The structure of tr_data
is a little nasty. It's an 11 item list where
every item is a list of length 1.
print(length(tr_data))
## [1] 11
print(map_dbl(tr_data, length))
## [1] 1 1 1 1 1 1 1 1 1 1 1
print(map_chr(tr_data, class))
## [1] "list" "list" "list" "list" "list" "list" "list" "list" "list" "list"
## [11] "list"
Let's convert the list into one tibble. The code below looks ugly but it's relatively easy to follow.
pluck
to "pull out" the main list from tr_data
map2_df
get_page
. Now we're
going to take those eleven urls and pass them as the second item to
iterate over in map2_df
.tr_data
list is converted to a tibble
and
then a new column is added with the URL page that the new tibble was
created from. (since tr_data
was created from the links in
df$stat_links
they line up perfectly).map2_df
instead of map2
the eleven tibbles are
binded into one.tibble
to set_names
to make the column names
more R friendly.tr_data <- pluck(tr_data, 1) %>%
map2_df(df$stat_links,
~as_tibble(.x) %>%
mutate(stat = .y)) %>%
set_names(c(
'rank',
'team',
'current_seas',
'last_3',
'last_1',
'home',
'away',
'last_seas',
'stat'
))
Let's look at the result. We have 9 columns. current_seas
and
last_seas
are the stats for 2017 and 2016 respectively. stat
is the
type of football statistic referenced. Because the season is only four
weeks in (at the time of writing this) let's drop the last_3
and
last_1
columns.
tr_data <- tr_data %>%
select(-last_3, -last_1)
Even after the cleanup this isn't the most efficient format for
analyzing the data. We want the stats to be the columns and the teams to
be repeated four times, for the years and the locations. The text format
of the stat
column isn't very R friendly for column names.
glimpse(tr_data)
## Observations: 352
## Variables: 7
## $ rank <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15...
## $ team <chr> "LA Rams", "New Orleans", "Philadelphia", "New En...
## $ current_seas <dbl> 0.461, 0.448, 0.434, 0.429, 0.420, 0.418, 0.397, ...
## $ home <dbl> 0.396, 0.486, 0.436, 0.416, 0.388, 0.457, 0.421, ...
## $ away <dbl> 0.518, 0.410, 0.431, 0.444, 0.457, 0.383, 0.366, ...
## $ last_seas <dbl> 0.233, 0.424, 0.340, 0.423, 0.390, 0.341, 0.383, ...
## $ stat <chr> "/nfl/stat/points-per-play", "/nfl/stat/points-pe...
stat
names.There's more than one way to accomplish most things in R. This might not
be the best way. I find it to be cleaner than using a function like
recode
. We're going to create a new dataframe with the unique values
of the stat
column and the new names as the other. We're going to join
that dataframe to the tr_data
and remove the original stat
column.
First let's look at the stat names.
raw_stat_names |
---|
/nfl/stat/points-per-play |
/nfl/stat/points-per-play-margin |
/nfl/stat/yards-per-play |
/nfl/stat/first-downs-per-play |
/nfl/stat/punts-per-play |
/nfl/stat/opponent-points-per-play |
/nfl/stat/opponent-yards-per-play |
/nfl/stat/opponent-first-downs-per-play |
/nfl/stat/opponent-punts-per-play |
/nfl/stat/penalties-per-play |
/nfl/stat/opponent-penalties-per-play |
These clearly aren't R friendly column names. Some cleanup is needed.
new_col_names <- raw_stat_names %>%
str_replace_all('/nfl/stat/','') %>%
str_replace_all('-','_') %>%
str_replace_all('points','pts') %>%
str_replace_all('yards','yds') %>%
str_replace_all('opponent','opp') %>%
str_replace_all('per_play','pp')
df_col_names <-
tibble(stat = unique(tr_data$stat),
new_cols = new_col_names)
df_col_names %>%
knitr::kable()
stat | new_cols |
---|---|
/nfl/stat/points-per-play | pts_pp |
/nfl/stat/points-per-play-margin | pts_pp_margin |
/nfl/stat/yards-per-play | yds_pp |
/nfl/stat/first-downs-per-play | first_downs_pp |
/nfl/stat/punts-per-play | punts_pp |
/nfl/stat/opponent-points-per-play | opp_pts_pp |
/nfl/stat/opponent-yards-per-play | opp_yds_pp |
/nfl/stat/opponent-first-downs-per-play | opp_first_downs_pp |
/nfl/stat/opponent-punts-per-play | opp_punts_pp |
/nfl/stat/penalties-per-play | penalties_pp |
/nfl/stat/opponent-penalties-per-play | opp_penalties_pp |
note: for the example below we're going to remove the Home
and Away
columns but you don't need to.
Now we remove the stat
column and then rename the new column we added
stat
. Next we gather
the dataframe by the season creating two new
columns year
and val
. This leaves us with a dataframe with four
columns; Team
, stat
, year
, val
. We want the values in the stat
column to be the new column names. Call spread
on stat
and use the
values in val
. Our new dataframe is 64 rows long (32 teams x 2
seasons) and 13 columns wide (the 11 stats + year + Team). Fix the
year
column by replacing "X2015"/"X2016" with 2015/2016.
tr_data <- tr_data %>%
left_join(df_col_names, by = "stat") %>%
select(-stat, -home, -away,-rank) %>%
rename(stat = new_cols) %>%
gather(year, val, current_seas:last_seas) %>%
spread(stat, val) %>%
mutate(year = recode(year,
"last_seas" = 2016,
"current_seas" = 2017)) %>%
arrange(year, team)
glimpse(tr_data)
## Observations: 64
## Variables: 13
## $ team <chr> "Arizona", "Atlanta", "Baltimore", "Buffalo...
## $ year <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2...
## $ first_downs_pp <dbl> 0.339, 0.385, 0.293, 0.324, 0.302, 0.335, 0...
## $ opp_first_downs_pp <dbl> 0.288, 0.343, 0.296, 0.332, 0.318, 0.318, 0...
## $ opp_penalties_pp <dbl> 0.06, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0...
## $ opp_pts_pp <dbl> 0.355, 0.379, 0.323, 0.371, 0.388, 0.394, 0...
## $ opp_punts_pp <dbl> 0.08, 0.06, 0.09, 0.07, 0.07, 0.07, 0.07, 0...
## $ opp_yds_pp <dbl> 4.8, 5.6, 5.2, 5.6, 5.6, 5.5, 5.4, 5.9, 5.5...
## $ penalties_pp <dbl> 0.05, 0.05, 0.06, 0.06, 0.05, 0.06, 0.04, 0...
## $ pts_pp <dbl> 0.385, 0.550, 0.318, 0.394, 0.351, 0.289, 0...
## $ pts_pp_margin <dbl> 0.030, 0.171, -0.005, 0.023, -0.038, -0.106...
## $ punts_pp <dbl> 0.07, 0.05, 0.07, 0.07, 0.07, 0.07, 0.07, 0...
## $ yds_pp <dbl> 5.4, 6.7, 5.2, 5.6, 5.2, 5.9, 5.4, 5.1, 6.0...
Now our data is in a tidy format and ready for analysis.
Here's the dataset teamrankings-dataset.csv