Web Scraping

Web Scraping from the IMDb Website

This demo is about basic web scraping.

We show how to parse (static) HTML content, extract relevant information, and store it in an R-friendly manner.

# Load required packages

library(tidyverse)
library(rvest)
library(stringr)
library(xml2)

Get contents

Specify URL for desired website to be scraped (http://www.imdb.com/chart/top?ref_=nv_mv_250_6) and parse website contents:

(url <- xml2::read_html("http://www.imdb.com/chart/top?ref_=nv_mv_250_6"))

## {html_document}
## <html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body id="styleguide-v2" class="fixed">\n            <img height="1" widt ...

Explore contents

Website contents do not come in nice tabular form – we need to access them using HTML nodes / CSS selectors (actual content is written in HTML, whereas CSS is responsible for style elements).

HTML nodes can be found via the developer tab (F12) in Google Chrome or using helper tools like SelectorGadget (including video tutorial).

Let’s see what our IMDb website looks like using the developer tab. Chrome offers a feature that lets us hover over the website and display the corresponding source code section. We inspect the title element “Top Rated Movies”:

The title element is apparently of class header. We obtain the same result using SelectorGadget (see pane in bottom-right corner):

Get movie titles

Using SelectorGadget once more, we find the movie titles to be stored in an element called .titleColumn a (note that SelectorGadget separates selectors by comma):

Once we have found this (which might take a while if the website structure is nested, so prepare for some trial-and-error here), rvest makes the rest rather convenient:

title_data <- rvest::html_nodes(url, ".titleColumn a") %>% 
  rvest::html_text()

head(title_data)

## [1] "Die Verurteilten"       "Der Pate"               "Der Pate 2"            
## [4] "The Dark Knight"        "Die zwölf Geschworenen" "Schindlers Liste"

Get release years

In order to scrape the movies’ year of release, we check again for the corresponding selector, this time using the developer pane:

year_data <- rvest::html_nodes(url, ".secondaryInfo") %>% 
  rvest::html_text()

head(year_data)

## [1] "(1994)" "(1972)" "(1974)" "(2008)" "(1957)" "(1993)"

Most of the time, we will need to clean the data after scraping. We extract the years without parentheses (more on regular expressions coming soon) and convert to numeric:

(year_data_clean <- year_data %>% 
  stringr::str_replace_all(pattern = "[\\(\\)]", replacement = "") %>% 
  as.numeric())

##   [1] 1994 1972 1974 2008 1957 1993 2003 1994 1966 2001 1999 1994 2010 2002 1980
##  [16] 1999 1990 1975 1954 1995 1997 2002 1991 1946 1998 1977 1999 2001 2014 2019
##  [31] 1994 1962 1994 1995 2002 1991 1985 1998 1936 2000 1960 2006 1931 2014 2011
##  [46] 1988 2006 1968 1942 1988 1954 1979 1979 2000 1940 1981 2012 2006 1957 2008
##  [61] 2020 2019 1980 2018 1950 1957 2018 2003 1997 1964 2012 1984 2016 1986 2017
##  [76] 2019 2018 1999 1995 1963 1995 1981 2009 1984 2009 1997 1983 2007 1992 1968
##  [91] 2000 2012 1958 1931 2004 1941 2016 1985 1952 1921 1948 1987 1952 2000 1959
## [106] 1983 1971 2019 2010 1976 2011 2010 1973 1962 2001 1927 1960 1965 2020 1944
## [121] 1962 2009 1989 1995 1997 2018 1988 2005 1961 1950 1975 2004 1997 1985 1992
## [136] 1959 2004 1950 1995 2013 2001 1963 2006 2007 2009 1961 1998 1980 1988 1948
## [151] 1954 2010 2017 1925 2005 1974 2007 2005 2015 1982 1980 1957 1999 2011 1993
## [166] 2019 1996 1998 1939 2003 2003 1979 2003 1957 1982 1996 1957 2014 1953 2008
## [181] 2015 1949 1954 1978 1993 1995 2014 2009 2002 2014 2016 2013 1966 1924 2018
## [196] 1998 1975 1942 1926 2019 2010 2013 1996 1939 1978 2015 1989 2004 2011 1959
## [211] 1986 1976 2009 2016 1967 2017 1971 1986 1953 2013 2007 1995 1959 1928 1979
## [226] 2015 2004 2012 2000 2001 1976 1966 1984 2020 1940 2004 2000 1988 2006 1955
## [241] 1984 2018 2013 2019 1934 2013 2015 1966 2000 2004

Get rankings

Now you know the selector drill, let’s proceed to the movies’ rankings. The rankings are stored within .titleColumn but the numbering does not have a separate selector.

rank_data <- rvest::html_nodes(url, ".titleColumn") %>% 
  rvest::html_text()

head(rank_data)

## [1] "\n      1.\n      Die Verurteilten\n        (1994)\n    "      
## [2] "\n      2.\n      Der Pate\n        (1972)\n    "              
## [3] "\n      3.\n      Der Pate 2\n        (1974)\n    "            
## [4] "\n      4.\n      The Dark Knight\n        (2008)\n    "       
## [5] "\n      5.\n      Die zwölf Geschworenen\n        (1957)\n    "
## [6] "\n      6.\n      Schindlers Liste\n        (1993)\n    "

We clean the extracted ranking data and convert to numeric, by first removing unwanted characters and white spaces…

rank_data_clean <- rank_data %>% 
  stringr::str_replace_all(pattern = "\n", replacement = "") %>%
  stringr::str_squish()

head(rank_data_clean)

## [1] "1. Die Verurteilten (1994)"       "2. Der Pate (1972)"              
## [3] "3. Der Pate 2 (1974)"             "4. The Dark Knight (2008)"       
## [5] "5. Die zwölf Geschworenen (1957)" "6. Schindlers Liste (1993)"

… and then extracting, for each entry, the number before the dot:

(rank_data_clean <- sapply(
  rank_data_clean,
  function(i) unlist(stringr::str_split(i, pattern = "\\."))[1]) %>% 
  as.numeric())

##   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
##  [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
##  [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
##  [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
##  [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
##  [91]  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108
## [109] 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
## [127] 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
## [145] 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162
## [163] 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
## [181] 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198
## [199] 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216
## [217] 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234
## [235] 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250

Get ratings

The ratings are easier to handle:

rating_data <- rvest::html_nodes(url, "strong") %>% 
  rvest::html_text()

head(rating_data)

## [1] "9.2" "9.1" "9.0" "9.0" "8.9" "8.9"

Convert ratings to numeric:

(rating_data_clean <- rating_data %>% 
  as.numeric())

##   [1] 9.2 9.1 9.0 9.0 8.9 8.9 8.9 8.8 8.8 8.8 8.8 8.8 8.7 8.7 8.7 8.6 8.6 8.6
##  [19] 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5
##  [37] 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.4 8.4 8.4 8.4 8.4 8.4 8.4
##  [55] 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.3 8.3 8.3 8.3 8.3 8.3
##  [73] 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3
##  [91] 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2
## [109] 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2
## [127] 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2
## [145] 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1
## [163] 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1
## [181] 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1
## [199] 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1
## [217] 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0
## [235] 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0

Store scraped data

Now we have everything we need and can store the data in an R-friendly object.

We will use a data.frame here. If you need to handle large data, you might want to consider using data.table instead, a powerful and very fast package for general data handling (its syntax is a bit different though).

imdb_scraped_data <- data.frame(
  ranking = rank_data_clean,
  title = title_data,
  year = year_data_clean,
  rating = rating_data_clean)

head(imdb_scraped_data)

##   ranking                  title year rating
## 1       1       Die Verurteilten 1994    9.2
## 2       2               Der Pate 1972    9.1
## 3       3             Der Pate 2 1974    9.0
## 4       4        The Dark Knight 2008    9.0
## 5       5 Die zwölf Geschworenen 1957    8.9
## 6       6       Schindlers Liste 1993    8.9

Nice, now we can use the movie data just like any other R object.