This demo is about scraping data from Twitter.
We show how to get Twitter contents using R via different queries and some basic analytic tools.
# Load required packages
library(rtweet)
library(stringr)
library(tidytext)
library(tidyverse)
We need to set up our API token prior to executing the following code, which we did in the background here. Note that you need to define your personal consumer_key
, consumer_secret
, access_token
, and access_secret
accordingly.
personal_token <- rtweet::create_token(
consumer_key = consumer_key,
consumer_secret = consumer_secret,
access_token = access_token,
access_secret = access_secret)
We search for tweets with hashtag #merkel. rtweet
conveniently returns a data.frame
object to which we add a query
column:
scraped_tweets_merkel <- rtweet::search_tweets(
"#merkel", # i.e.: match exact phrase, # for hashtag, @ for mentions)
n = 1000, # cap to 1k hits (max: 18k)
lang = "de", # set account language
include_rts = FALSE, # exclude retweets
type = "recent", # alternatively, "popular" or "mixed"
token = personal_token)
scraped_tweets_merkel$query <- "Merkel"
Now, search for tweets with hashtag #AfD:
scraped_tweets_afd <- rtweet::search_tweets(
"#AfD",
n = 1000,
lang = "de",
include_rts = FALSE,
type = "recent",
token = personal_token)
scraped_tweets_afd$query <- "AfD"
Combine both:
combined_tweets <- dplyr::bind_rows(scraped_tweets_merkel, scraped_tweets_afd)
head(combined_tweets)
## # A tibble: 6 x 91
## user_id status_id created_at screen_name text source
## <chr> <chr> <dttm> <chr> <chr> <chr>
## 1 120252~ 13873014~ 2021-04-28 07:03:27 ArdillaCal~ "@Jo~ Twitt~
## 2 128568~ 13873009~ 2021-04-28 07:01:32 ding_lebin "Ver~ Twitt~
## 3 124963~ 13873001~ 2021-04-28 06:58:25 Dr_hc_Jasc~ "Auc~ Twitt~
## 4 431065~ 13872996~ 2021-04-28 06:56:18 zukunft37 "Zur~ Twitt~
## 5 431065~ 13869572~ 2021-04-27 08:15:47 zukunft37 "Fra~ Twitt~
## 6 431065~ 13870315~ 2021-04-27 13:11:11 zukunft37 "Hie~ Twitt~
## # ... with 85 more variables: display_text_width <dbl>,
## # reply_to_status_id <chr>, reply_to_user_id <chr>,
## # reply_to_screen_name <chr>, is_quote <lgl>, is_retweet <lgl>,
## # favorite_count <int>, retweet_count <int>, quote_count <int>,
## # reply_count <int>, hashtags <list>, symbols <list>, urls_url <list>,
## # urls_t.co <list>, urls_expanded_url <list>, media_url <list>,
## # media_t.co <list>, media_expanded_url <list>, media_type <list>,
## # ext_media_url <list>, ext_media_t.co <list>, ext_media_expanded_url <list>,
## # ext_media_type <chr>, mentions_user_id <list>, mentions_screen_name <list>,
## # lang <chr>, quoted_status_id <chr>, quoted_text <chr>,
## # quoted_created_at <dttm>, quoted_source <chr>, quoted_favorite_count <int>,
## # quoted_retweet_count <int>, quoted_user_id <chr>, quoted_screen_name <chr>,
## # quoted_name <chr>, quoted_followers_count <int>,
## # quoted_friends_count <int>, quoted_statuses_count <int>,
## # quoted_location <chr>, quoted_description <chr>, quoted_verified <lgl>,
## # retweet_status_id <chr>, retweet_text <chr>, retweet_created_at <dttm>,
## # retweet_source <chr>, retweet_favorite_count <int>,
## # retweet_retweet_count <int>, retweet_user_id <chr>,
## # retweet_screen_name <chr>, retweet_name <chr>,
## # retweet_followers_count <int>, retweet_friends_count <int>,
## # retweet_statuses_count <int>, retweet_location <chr>,
## # retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
## # place_name <chr>, place_full_name <chr>, place_type <chr>, country <chr>,
## # country_code <chr>, geo_coords <list>, coords_coords <list>,
## # bbox_coords <list>, status_url <chr>, name <chr>, location <chr>,
## # description <chr>, url <chr>, protected <lgl>, followers_count <int>,
## # friends_count <int>, listed_count <int>, statuses_count <int>,
## # favourites_count <int>, account_created_at <dttm>, verified <lgl>,
## # profile_url <chr>, profile_expanded_url <chr>, account_lang <lgl>,
## # profile_banner_url <chr>, profile_background_url <chr>,
## # profile_image_url <chr>, query <chr>
We can also conduct multiple independent search queries. The following examples gives us a list
, each of whose entries is a data.frame
as in the previous examples. In order to coerce the entries to a single data.frame
, we call rbind
on the list
elements afterwards:
# Combine multiple search queries
tweets_multiple <- lapply(
c("#afd", "#corona OR #merkel"),
function(i) {rtweet::search_tweets(
q = i,
n = 100,
lang = "de",
token = personal_token)})
# Use rtweet's wrapper around do.call(rbind, ...)
tweets_multiple_df <- rtweet::do_call_rbind(tweets_multiple)
head(tweets_multiple_df)
## # A tibble: 6 x 90
## user_id status_id created_at screen_name text source
## <chr> <chr> <dttm> <chr> <chr> <chr>
## 1 133082~ 13873026~ 2021-04-28 07:08:12 AfdAachen "Die~ Twitt~
## 2 179629~ 13873025~ 2021-04-28 07:07:48 MaddyLina25 "Auc~ Twitt~
## 3 108529~ 13873024~ 2021-04-28 07:07:40 LenaSturmT~ "Es ~ Twitt~
## 4 108529~ 13872974~ 2021-04-28 06:47:37 LenaSturmT~ "„An~ Twitt~
## 5 124413~ 13873022~ 2021-04-28 07:06:55 Macaveli85 "@De~ Twitt~
## 6 130766~ 13873020~ 2021-04-28 07:06:00 aa24twa "Sin~ Twitt~
## # ... with 84 more variables: display_text_width <dbl>,
## # reply_to_status_id <chr>, reply_to_user_id <chr>,
## # reply_to_screen_name <chr>, is_quote <lgl>, is_retweet <lgl>,
## # favorite_count <int>, retweet_count <int>, quote_count <int>,
## # reply_count <int>, hashtags <list>, symbols <list>, urls_url <list>,
## # urls_t.co <list>, urls_expanded_url <list>, media_url <list>,
## # media_t.co <list>, media_expanded_url <list>, media_type <list>,
## # ext_media_url <list>, ext_media_t.co <list>, ext_media_expanded_url <list>,
## # ext_media_type <chr>, mentions_user_id <list>, mentions_screen_name <list>,
## # lang <chr>, quoted_status_id <chr>, quoted_text <chr>,
## # quoted_created_at <dttm>, quoted_source <chr>, quoted_favorite_count <int>,
## # quoted_retweet_count <int>, quoted_user_id <chr>, quoted_screen_name <chr>,
## # quoted_name <chr>, quoted_followers_count <int>,
## # quoted_friends_count <int>, quoted_statuses_count <int>,
## # quoted_location <chr>, quoted_description <chr>, quoted_verified <lgl>,
## # retweet_status_id <chr>, retweet_text <chr>, retweet_created_at <dttm>,
## # retweet_source <chr>, retweet_favorite_count <int>,
## # retweet_retweet_count <int>, retweet_user_id <chr>,
## # retweet_screen_name <chr>, retweet_name <chr>,
## # retweet_followers_count <int>, retweet_friends_count <int>,
## # retweet_statuses_count <int>, retweet_location <chr>,
## # retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
## # place_name <chr>, place_full_name <chr>, place_type <chr>, country <chr>,
## # country_code <chr>, geo_coords <list>, coords_coords <list>,
## # bbox_coords <list>, status_url <chr>, name <chr>, location <chr>,
## # description <chr>, url <chr>, protected <lgl>, followers_count <int>,
## # friends_count <int>, listed_count <int>, statuses_count <int>,
## # favourites_count <int>, account_created_at <dttm>, verified <lgl>,
## # profile_url <chr>, profile_expanded_url <chr>, account_lang <lgl>,
## # profile_banner_url <chr>, profile_background_url <chr>,
## # profile_image_url <chr>
We can also search for users:
scraped_tweets_lauterbach <- rtweet::get_timeline (
"Karl_Lauterbach",
n = 100,
lang = "de",
include_rts = FALSE,
type = "popular",
token = personal_token)
head(scraped_tweets_lauterbach)
## # A tibble: 6 x 90
## user_id status_id created_at screen_name text source
## <chr> <chr> <dttm> <chr> <chr> <chr>
## 1 329298~ 13871918~ 2021-04-27 23:48:14 Karl_Laute~ Förd~ Twitt~
## 2 329298~ 13871882~ 2021-04-27 23:33:38 Karl_Laute~ Für ~ Twitt~
## 3 329298~ 13871834~ 2021-04-27 23:14:39 Karl_Laute~ Ich ~ Twitt~
## 4 329298~ 13864672~ 2021-04-25 23:48:40 Karl_Laute~ Sehe~ Twitt~
## 5 329298~ 13864492~ 2021-04-25 22:37:04 Karl_Laute~ (3) ~ Twitt~
## 6 329298~ 13864483~ 2021-04-25 22:33:34 Karl_Laute~ (2) ~ Twitt~
## # ... with 84 more variables: display_text_width <dbl>,
## # reply_to_status_id <chr>, reply_to_user_id <chr>,
## # reply_to_screen_name <chr>, is_quote <lgl>, is_retweet <lgl>,
## # favorite_count <int>, retweet_count <int>, quote_count <int>,
## # reply_count <int>, hashtags <list>, symbols <list>, urls_url <list>,
## # urls_t.co <list>, urls_expanded_url <list>, media_url <list>,
## # media_t.co <list>, media_expanded_url <list>, media_type <list>,
## # ext_media_url <list>, ext_media_t.co <list>, ext_media_expanded_url <list>,
## # ext_media_type <chr>, mentions_user_id <list>, mentions_screen_name <list>,
## # lang <chr>, quoted_status_id <chr>, quoted_text <chr>,
## # quoted_created_at <dttm>, quoted_source <chr>, quoted_favorite_count <int>,
## # quoted_retweet_count <int>, quoted_user_id <chr>, quoted_screen_name <chr>,
## # quoted_name <chr>, quoted_followers_count <int>,
## # quoted_friends_count <int>, quoted_statuses_count <int>,
## # quoted_location <chr>, quoted_description <chr>, quoted_verified <lgl>,
## # retweet_status_id <chr>, retweet_text <chr>, retweet_created_at <dttm>,
## # retweet_source <chr>, retweet_favorite_count <int>,
## # retweet_retweet_count <int>, retweet_user_id <chr>,
## # retweet_screen_name <chr>, retweet_name <chr>,
## # retweet_followers_count <int>, retweet_friends_count <int>,
## # retweet_statuses_count <int>, retweet_location <chr>,
## # retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
## # place_name <chr>, place_full_name <chr>, place_type <chr>, country <chr>,
## # country_code <chr>, geo_coords <list>, coords_coords <list>,
## # bbox_coords <list>, status_url <chr>, name <chr>, location <chr>,
## # description <chr>, url <chr>, protected <lgl>, followers_count <int>,
## # friends_count <int>, listed_count <int>, statuses_count <int>,
## # favourites_count <int>, account_created_at <dttm>, verified <lgl>,
## # profile_url <chr>, profile_expanded_url <chr>, account_lang <lgl>,
## # profile_banner_url <chr>, profile_background_url <chr>,
## # profile_image_url <chr>
rtweet
provides some other useful functionalities. For instance, analyzing tweets might require extraction of emojis:
head(rtweet::emojis)
## # A tibble: 6 x 2
## code description
## <chr> <chr>
## 1 "\U0001f600" grinning face
## 2 "\U0001f601" beaming face with smiling eyes
## 3 "\U0001f602" face with tears of joy
## 4 "\U0001f923" rolling on the floor laughing
## 5 "\U0001f603" grinning face with big eyes
## 6 "\U0001f604" grinning face with smiling eyes
Exploring the data we have just scraped, we can, for instance, analyze the frequency of tweets over time:
summary(combined_tweets$created_at)
## Min. 1st Qu. Median
## "2021-04-26 09:58:12" "2021-04-26 18:54:25" "2021-04-27 09:29:39"
## Mean 3rd Qu. Max.
## "2021-04-27 07:15:33" "2021-04-27 16:21:21" "2021-04-28 07:06:55"
combined_tweets %>%
dplyr::group_by(query) %>%
rtweet::ts_plot("hours", cex = 0.8) +
labs(
x = NULL,
y = NULL,
title = "Frequency of tweets",
caption = "Data collected from Twitter's REST API via rtweet") +
theme_minimal()
We find the most retweeted tweet…
most_retweeted_tweet <- combined_tweets %>%
dplyr::arrange(-retweet_count) %>%
dplyr::slice(1) %>%
dplyr::select(created_at, screen_name, text, retweet_count, status_id)
print(most_retweeted_tweet$text)
## [1] "Wird auf das Auto von @Karl_Lauterbach ein Farbanschlag verübt: Die Leitmedien empören sich bundesweit. #AfD-Wahlkämpfer werden brutal attackiert: Randnotiz. Finde den Fehler... https://t.co/hBv8JGmhsC"
… and, lastly, the top hashtags:
combined_tweets %>%
tidytext::unnest_tokens(hashtag, text, "tweets", to_lower = FALSE) %>%
dplyr::filter(
stringr::str_detect(hashtag, "^#")) %>%
dplyr::count(hashtag) %>%
dplyr::slice_max(n, n = 5)
## # A tibble: 5 x 2
## hashtag n
## <chr> <int>
## 1 #Merkel 904
## 2 #AfD 759
## 3 #Corona 117
## 4 #CDU 114
## 5 #AFD 93
The limitations on time period and number of tweets might be prohibitive, depending on the goal of the analysis, but otherwise, rtweet
is quite powerful and convenient to handle.