Scraping Twitter Data

Scraping Twitter

This demo is about scraping data from Twitter.

We show how to get Twitter contents using R via different queries and some basic analytic tools.

# Load required packages

library(rtweet)
library(stringr)
library(tidytext)
library(tidyverse)

Set up API access

We need to set up our API token prior to executing the following code, which we did in the background here. Note that you need to define your personal consumer_key, consumer_secret, access_token, and access_secret accordingly.

personal_token <- rtweet::create_token(
  consumer_key = consumer_key,
  consumer_secret = consumer_secret,
  access_token = access_token,
  access_secret = access_secret)

Extract relevant tweets

We search for tweets with hashtag #merkel. rtweet conveniently returns a data.frame object to which we add a query column:

scraped_tweets_merkel <- rtweet::search_tweets(
  "#merkel", # i.e.: match exact phrase, # for hashtag, @ for mentions)
  n = 1000, # cap to 1k hits (max: 18k)
  lang = "de", # set account language
  include_rts = FALSE, # exclude retweets
  type = "recent", # alternatively, "popular" or "mixed"
  token = personal_token)

scraped_tweets_merkel$query <- "Merkel"

Now, search for tweets with hashtag #AfD:

scraped_tweets_afd <- rtweet::search_tweets(
  "#AfD",
  n = 1000,
  lang = "de", 
  include_rts = FALSE, 
  type = "recent",
  token = personal_token)

scraped_tweets_afd$query <- "AfD"

Combine both:

combined_tweets <- dplyr::bind_rows(scraped_tweets_merkel, scraped_tweets_afd)

head(combined_tweets)

## # A tibble: 6 x 91
##   user_id status_id created_at          screen_name text  source
##   <chr>   <chr>     <dttm>              <chr>       <chr> <chr> 
## 1 120252~ 13873014~ 2021-04-28 07:03:27 ArdillaCal~ "@Jo~ Twitt~
## 2 128568~ 13873009~ 2021-04-28 07:01:32 ding_lebin  "Ver~ Twitt~
## 3 124963~ 13873001~ 2021-04-28 06:58:25 Dr_hc_Jasc~ "Auc~ Twitt~
## 4 431065~ 13872996~ 2021-04-28 06:56:18 zukunft37   "Zur~ Twitt~
## 5 431065~ 13869572~ 2021-04-27 08:15:47 zukunft37   "Fra~ Twitt~
## 6 431065~ 13870315~ 2021-04-27 13:11:11 zukunft37   "Hie~ Twitt~
## # ... with 85 more variables: display_text_width <dbl>,
## #   reply_to_status_id <chr>, reply_to_user_id <chr>,
## #   reply_to_screen_name <chr>, is_quote <lgl>, is_retweet <lgl>,
## #   favorite_count <int>, retweet_count <int>, quote_count <int>,
## #   reply_count <int>, hashtags <list>, symbols <list>, urls_url <list>,
## #   urls_t.co <list>, urls_expanded_url <list>, media_url <list>,
## #   media_t.co <list>, media_expanded_url <list>, media_type <list>,
## #   ext_media_url <list>, ext_media_t.co <list>, ext_media_expanded_url <list>,
## #   ext_media_type <chr>, mentions_user_id <list>, mentions_screen_name <list>,
## #   lang <chr>, quoted_status_id <chr>, quoted_text <chr>,
## #   quoted_created_at <dttm>, quoted_source <chr>, quoted_favorite_count <int>,
## #   quoted_retweet_count <int>, quoted_user_id <chr>, quoted_screen_name <chr>,
## #   quoted_name <chr>, quoted_followers_count <int>,
## #   quoted_friends_count <int>, quoted_statuses_count <int>,
## #   quoted_location <chr>, quoted_description <chr>, quoted_verified <lgl>,
## #   retweet_status_id <chr>, retweet_text <chr>, retweet_created_at <dttm>,
## #   retweet_source <chr>, retweet_favorite_count <int>,
## #   retweet_retweet_count <int>, retweet_user_id <chr>,
## #   retweet_screen_name <chr>, retweet_name <chr>,
## #   retweet_followers_count <int>, retweet_friends_count <int>,
## #   retweet_statuses_count <int>, retweet_location <chr>,
## #   retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
## #   place_name <chr>, place_full_name <chr>, place_type <chr>, country <chr>,
## #   country_code <chr>, geo_coords <list>, coords_coords <list>,
## #   bbox_coords <list>, status_url <chr>, name <chr>, location <chr>,
## #   description <chr>, url <chr>, protected <lgl>, followers_count <int>,
## #   friends_count <int>, listed_count <int>, statuses_count <int>,
## #   favourites_count <int>, account_created_at <dttm>, verified <lgl>,
## #   profile_url <chr>, profile_expanded_url <chr>, account_lang <lgl>,
## #   profile_banner_url <chr>, profile_background_url <chr>,
## #   profile_image_url <chr>, query <chr>

We can also conduct multiple independent search queries. The following examples gives us a list, each of whose entries is a data.frame as in the previous examples. In order to coerce the entries to a single data.frame, we call rbind on the list elements afterwards:

# Combine multiple search queries

tweets_multiple <- lapply(
  c("#afd", "#corona OR #merkel"),
  function(i) {rtweet::search_tweets(
    q = i, 
    n = 100,
    lang = "de", 
    token = personal_token)})

# Use rtweet's wrapper around do.call(rbind, ...)

tweets_multiple_df <- rtweet::do_call_rbind(tweets_multiple) 

head(tweets_multiple_df)

## # A tibble: 6 x 90
##   user_id status_id created_at          screen_name text  source
##   <chr>   <chr>     <dttm>              <chr>       <chr> <chr> 
## 1 133082~ 13873026~ 2021-04-28 07:08:12 AfdAachen   "Die~ Twitt~
## 2 179629~ 13873025~ 2021-04-28 07:07:48 MaddyLina25 "Auc~ Twitt~
## 3 108529~ 13873024~ 2021-04-28 07:07:40 LenaSturmT~ "Es ~ Twitt~
## 4 108529~ 13872974~ 2021-04-28 06:47:37 LenaSturmT~ "„An~ Twitt~
## 5 124413~ 13873022~ 2021-04-28 07:06:55 Macaveli85  "@De~ Twitt~
## 6 130766~ 13873020~ 2021-04-28 07:06:00 aa24twa     "Sin~ Twitt~
## # ... with 84 more variables: display_text_width <dbl>,
## #   reply_to_status_id <chr>, reply_to_user_id <chr>,
## #   reply_to_screen_name <chr>, is_quote <lgl>, is_retweet <lgl>,
## #   favorite_count <int>, retweet_count <int>, quote_count <int>,
## #   reply_count <int>, hashtags <list>, symbols <list>, urls_url <list>,
## #   urls_t.co <list>, urls_expanded_url <list>, media_url <list>,
## #   media_t.co <list>, media_expanded_url <list>, media_type <list>,
## #   ext_media_url <list>, ext_media_t.co <list>, ext_media_expanded_url <list>,
## #   ext_media_type <chr>, mentions_user_id <list>, mentions_screen_name <list>,
## #   lang <chr>, quoted_status_id <chr>, quoted_text <chr>,
## #   quoted_created_at <dttm>, quoted_source <chr>, quoted_favorite_count <int>,
## #   quoted_retweet_count <int>, quoted_user_id <chr>, quoted_screen_name <chr>,
## #   quoted_name <chr>, quoted_followers_count <int>,
## #   quoted_friends_count <int>, quoted_statuses_count <int>,
## #   quoted_location <chr>, quoted_description <chr>, quoted_verified <lgl>,
## #   retweet_status_id <chr>, retweet_text <chr>, retweet_created_at <dttm>,
## #   retweet_source <chr>, retweet_favorite_count <int>,
## #   retweet_retweet_count <int>, retweet_user_id <chr>,
## #   retweet_screen_name <chr>, retweet_name <chr>,
## #   retweet_followers_count <int>, retweet_friends_count <int>,
## #   retweet_statuses_count <int>, retweet_location <chr>,
## #   retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
## #   place_name <chr>, place_full_name <chr>, place_type <chr>, country <chr>,
## #   country_code <chr>, geo_coords <list>, coords_coords <list>,
## #   bbox_coords <list>, status_url <chr>, name <chr>, location <chr>,
## #   description <chr>, url <chr>, protected <lgl>, followers_count <int>,
## #   friends_count <int>, listed_count <int>, statuses_count <int>,
## #   favourites_count <int>, account_created_at <dttm>, verified <lgl>,
## #   profile_url <chr>, profile_expanded_url <chr>, account_lang <lgl>,
## #   profile_banner_url <chr>, profile_background_url <chr>,
## #   profile_image_url <chr>

We can also search for users:

scraped_tweets_lauterbach <- rtweet::get_timeline (
  "Karl_Lauterbach",
  n = 100,
  lang = "de", 
  include_rts = FALSE, 
  type = "popular",
  token = personal_token)

head(scraped_tweets_lauterbach)

## # A tibble: 6 x 90
##   user_id status_id created_at          screen_name text  source
##   <chr>   <chr>     <dttm>              <chr>       <chr> <chr> 
## 1 329298~ 13871918~ 2021-04-27 23:48:14 Karl_Laute~ Förd~ Twitt~
## 2 329298~ 13871882~ 2021-04-27 23:33:38 Karl_Laute~ Für ~ Twitt~
## 3 329298~ 13871834~ 2021-04-27 23:14:39 Karl_Laute~ Ich ~ Twitt~
## 4 329298~ 13864672~ 2021-04-25 23:48:40 Karl_Laute~ Sehe~ Twitt~
## 5 329298~ 13864492~ 2021-04-25 22:37:04 Karl_Laute~ (3) ~ Twitt~
## 6 329298~ 13864483~ 2021-04-25 22:33:34 Karl_Laute~ (2) ~ Twitt~
## # ... with 84 more variables: display_text_width <dbl>,
## #   reply_to_status_id <chr>, reply_to_user_id <chr>,
## #   reply_to_screen_name <chr>, is_quote <lgl>, is_retweet <lgl>,
## #   favorite_count <int>, retweet_count <int>, quote_count <int>,
## #   reply_count <int>, hashtags <list>, symbols <list>, urls_url <list>,
## #   urls_t.co <list>, urls_expanded_url <list>, media_url <list>,
## #   media_t.co <list>, media_expanded_url <list>, media_type <list>,
## #   ext_media_url <list>, ext_media_t.co <list>, ext_media_expanded_url <list>,
## #   ext_media_type <chr>, mentions_user_id <list>, mentions_screen_name <list>,
## #   lang <chr>, quoted_status_id <chr>, quoted_text <chr>,
## #   quoted_created_at <dttm>, quoted_source <chr>, quoted_favorite_count <int>,
## #   quoted_retweet_count <int>, quoted_user_id <chr>, quoted_screen_name <chr>,
## #   quoted_name <chr>, quoted_followers_count <int>,
## #   quoted_friends_count <int>, quoted_statuses_count <int>,
## #   quoted_location <chr>, quoted_description <chr>, quoted_verified <lgl>,
## #   retweet_status_id <chr>, retweet_text <chr>, retweet_created_at <dttm>,
## #   retweet_source <chr>, retweet_favorite_count <int>,
## #   retweet_retweet_count <int>, retweet_user_id <chr>,
## #   retweet_screen_name <chr>, retweet_name <chr>,
## #   retweet_followers_count <int>, retweet_friends_count <int>,
## #   retweet_statuses_count <int>, retweet_location <chr>,
## #   retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
## #   place_name <chr>, place_full_name <chr>, place_type <chr>, country <chr>,
## #   country_code <chr>, geo_coords <list>, coords_coords <list>,
## #   bbox_coords <list>, status_url <chr>, name <chr>, location <chr>,
## #   description <chr>, url <chr>, protected <lgl>, followers_count <int>,
## #   friends_count <int>, listed_count <int>, statuses_count <int>,
## #   favourites_count <int>, account_created_at <dttm>, verified <lgl>,
## #   profile_url <chr>, profile_expanded_url <chr>, account_lang <lgl>,
## #   profile_banner_url <chr>, profile_background_url <chr>,
## #   profile_image_url <chr>

rtweet provides some other useful functionalities. For instance, analyzing tweets might require extraction of emojis:

head(rtweet::emojis)

## # A tibble: 6 x 2
##   code         description                    
##   <chr>        <chr>                          
## 1 "\U0001f600" grinning face                  
## 2 "\U0001f601" beaming face with smiling eyes 
## 3 "\U0001f602" face with tears of joy         
## 4 "\U0001f923" rolling on the floor laughing  
## 5 "\U0001f603" grinning face with big eyes    
## 6 "\U0001f604" grinning face with smiling eyes

Explore scraped data

Exploring the data we have just scraped, we can, for instance, analyze the frequency of tweets over time:

summary(combined_tweets$created_at)

##                  Min.               1st Qu.                Median 
## "2021-04-26 09:58:12" "2021-04-26 18:54:25" "2021-04-27 09:29:39" 
##                  Mean               3rd Qu.                  Max. 
## "2021-04-27 07:15:33" "2021-04-27 16:21:21" "2021-04-28 07:06:55"

combined_tweets %>%
  dplyr::group_by(query) %>%
  rtweet::ts_plot("hours", cex = 0.8) +
  labs(
    x = NULL, 
    y = NULL,
    title = "Frequency of tweets",
    caption = "Data collected from Twitter's REST API via rtweet") +
  
  theme_minimal()

We find the most retweeted tweet…

most_retweeted_tweet <- combined_tweets %>% 
  dplyr::arrange(-retweet_count) %>%
  dplyr::slice(1) %>% 
  dplyr::select(created_at, screen_name, text, retweet_count, status_id)

print(most_retweeted_tweet$text)

## [1] "Wird auf das Auto von @Karl_Lauterbach ein Farbanschlag verübt: Die Leitmedien empören sich bundesweit. #AfD-Wahlkämpfer werden brutal attackiert: Randnotiz. Finde den Fehler... https://t.co/hBv8JGmhsC"

… and, lastly, the top hashtags:

combined_tweets %>% 
  tidytext::unnest_tokens(hashtag, text, "tweets", to_lower = FALSE) %>%
  dplyr::filter(
    stringr::str_detect(hashtag, "^#")) %>%
  dplyr::count(hashtag) %>%
  dplyr::slice_max(n, n = 5)

## # A tibble: 5 x 2
##   hashtag     n
##   <chr>   <int>
## 1 #Merkel   904
## 2 #AfD      759
## 3 #Corona   117
## 4 #CDU      114
## 5 #AFD       93

The limitations on time period and number of tweets might be prohibitive, depending on the goal of the analysis, but otherwise, rtweet is quite powerful and convenient to handle.

Scraping Twitter Data

Asmik & Lisa – for Intro to NLP

April/May 2021