This demo is about topic modeling with a structural topic model (STM).
We show how to prepare the data for fitting an STM (Roberts et al., 2016), conduct the modeling process, and analyze the results.
# Load required packages
library(data.table)
library(quanteda)
library(stm)
dfm
objectAs in the static feature extraction demo, we convert our data to a dfm
. However, we focus on slightly different tokens this time as we are now interested in identifying topical relations.
First, we get the data (the path
variable again requiring individual adaptation):
# path <- ... (individual file location)
twitter_corpus <- readRDS(sprintf("%s/twitter_corpus.RDS", path))
twitter_corpus
## Corpus consisting of 1,215 documents and 16 docvars.
## ABaerbock15156546001 :
## "Habe mir das Gro Ko-Sondierungspapier zu Klima nochmal genau..."
##
## ABaerbock15210084001 :
## "Auch weltweit sieht man, dass Angela Merkel s GroKo 3.0 nich..."
##
## ABaerbock15216341401 :
## "Der groessten globalen Herausforderung, der Klimakrise, 30 S..."
##
## ABaerbock15252349801 :
## "Wir brauchen eine andere Verkehrspolitik - weg von Benzinern..."
##
## ABaerbock15256297201 :
## "Das ist die Leistung von Hunderten Wahlkaempfern und Politik..."
##
## ABaerbock15283779001 :
## "Luise Amtsberg hat heute alles gesagt,was es zum Familiennac..."
##
## [ reached max_ndoc ... 1,209 more documents ]
We furthermore store the document texts in the docvars (which poses a redundancy in the corpus
object but ensures the texts are not lost during conversion for the dfm
).
quanteda::docvars(twitter_corpus)$text <- quanteda::texts(twitter_corpus)
Now let’s tokenize our corpus in a way appropriate for topic modeling.
We assume that topical information is mostly carried by nouns and choose to remove all other words for document representation, attempting to eliminate noise that might distract our topic model. This is, of course, a simplification, but that’s life (of an analyst).
Luckily, nouns are pretty easy to identify by their leading capital letters in German (implicitly relying on MPs’ grammar skills and/or auto-correct, and accepting that this will also include non-nouns from the beginning of a sentence):
# First, tokenize as usual
twitter_tokens <- quanteda::tokens(
twitter_corpus,
what = "word",
remove_symbols = TRUE,
remove_numbers = TRUE,
remove_separators = TRUE,
split_hyphens = TRUE,
include_docvars = TRUE)
# Stem
twitter_tokens <- quanteda::tokens_wordstem(
twitter_tokens,
language = "german")
# Keep only tokens starting with a capital letter
twitter_tokens <- quanteda::tokens_keep(
twitter_tokens,
pattern = c("[:upper:]([:lower:])+"),
valuetype = "regex",
case_insensitive = FALSE)
# Lowercase afterwards
twitter_tokens <- quanteda::tokens_tolower(twitter_tokens)
For stopwords removal we retrace the steps taken in the static feature extraction:
# Remove stopwords using our cleaning function from earlier
stopwords <- clean_and_stem(quanteda::stopwords(language = "de"))
stopwords <- stopwords[nchar(stopwords) > 0]
stopwords <- stringr::str_remove_all(
stopwords,
"kein(.)*|nicht")
stopwords <- c(
stopwords,
c("der", "die", "das", "was", "wer", "wie", "ich", "sie", "wir", "ihr"))
saveRDS(stopwords, sprintf("%s/stopwords.RDS", path))
twitter_tokens <- quanteda::tokens_remove(
twitter_tokens,
pattern = stopwords)
twitter_tokens <- quanteda::tokens_select(twitter_tokens, min_nchar = 3)
# Inspect
twitter_tokens
## Tokens consisting of 1,215 documents and 17 docvars.
## ABaerbock15156546001 :
## [1] "gro" "sondierungspapi" "klima" "krass"
## [5] "kyoto" "protokoll" "meilenstein" "klimadiplomati"
## [9] "frau" "merkel" "nachbess"
##
## ABaerbock15210084001 :
## [1] "angela" "merkel" "groko"
## [4] "paris" "abkomm" "dah"
## [7] "klimasofortprogramm"
##
## ABaerbock15216341401 :
## [1] "herausforder" "klimakris" "sek"
## [4] "std" "regierungserklaer" "leerstell"
## [7] "koalitionsvertrag" "fuess" "klimafolg"
##
## ABaerbock15252349801 :
## [1] "verkehrspolit" "benzin" "diesel" "mobilitaet"
## [5] "frag" "luft" "wohnung" "strass"
##
## ABaerbock15256297201 :
## [1] "leistung" "hundert" "wahlkaempf" "polit" "land"
## [6] "mensch" "konkret" "polit" "kommunalwahl"
##
## ABaerbock15283779001 :
## [1] "luis" "amtsberg" "familiennachzug" "integration"
## [5] "gefluechtet" "perspektiv" "zusammenleb" "famili"
## [9] "grundrecht" "famili" "groko"
##
## [ reached max_ndoc ... 1,209 more documents ]
Looking good, so convert to dfm
:
twitter_dfm <- quanteda::dfm(twitter_tokens)
twitter_dfm
## Document-feature matrix of: 1,215 documents, 4,923 features (99.8% sparse) and 17 docvars.
## features
## docs gro sondierungspapi klima krass kyoto protokoll
## ABaerbock15156546001 1 1 1 1 1 1
## ABaerbock15210084001 0 0 0 0 0 0
## ABaerbock15216341401 0 0 0 0 0 0
## ABaerbock15252349801 0 0 0 0 0 0
## ABaerbock15256297201 0 0 0 0 0 0
## ABaerbock15283779001 0 0 0 0 0 0
## features
## docs meilenstein klimadiplomati frau merkel
## ABaerbock15156546001 1 1 1 1
## ABaerbock15210084001 0 0 0 1
## ABaerbock15216341401 0 0 0 0
## ABaerbock15252349801 0 0 0 0
## ABaerbock15256297201 0 0 0 0
## ABaerbock15283779001 0 0 0 0
## [ reached max_ndoc ... 1,209 more documents, reached max_nfeat ... 4,913 more features ]
Unfortunately, as the above output tells us, our dfm
is extremely sparse – tokens are not at all common across documents, which will pose a challenge to any topic model. This is actually not too surprising when we recall the extreme brevity of our Twitter documents.
Side note: if you run into this problem you might want to consider the option of pooling texts into larger documents (this can be achieved, for instance, calling quanteda::dfm_group
). However, this is obviously a heavy simplification and leads to all documents that share the same grouping variable being assigned the same topic label.
What we will do to alleviate the sparsity problem at least to some extent is to condense the dfm
to the most prominent features:
# Eliminate less common features
twitter_dfm <- quanteda::dfm_select(
twitter_dfm,
names(quanteda::topfeatures(twitter_dfm, n = 300)))
twitter_dfm
## Document-feature matrix of: 1,215 documents, 300 features (98.8% sparse) and 17 docvars.
## features
## docs klima frau merkel groko klimakris diesel frag luft
## ABaerbock15156546001 1 1 1 0 0 0 0 0
## ABaerbock15210084001 0 0 1 1 0 0 0 0
## ABaerbock15216341401 0 0 0 0 1 0 0 0
## ABaerbock15252349801 0 0 0 0 0 1 1 1
## ABaerbock15256297201 0 0 0 0 0 0 0 0
## ABaerbock15283779001 0 0 0 1 0 0 0 0
## features
## docs strass polit
## ABaerbock15156546001 0 0
## ABaerbock15210084001 0 0
## ABaerbock15216341401 0 0
## ABaerbock15252349801 1 0
## ABaerbock15256297201 0 2
## ABaerbock15283779001 0 0
## [ reached max_ndoc ... 1,209 more documents, reached max_nfeat ... 290 more features ]
The special trait of the STM is its making use of s document-level meta data, so we need to specify which information to use here. In the stm
function the corresponding argument is called prevalence
and must be stated as a formula.
We will use the following meta variables (smooth effects defaulting to B-splines):
party
as a categorical variable,bundesland
as a categorical variable,unemployment_rate
as a smooth effect with 5 degrees of freedom, andshare_pop_migration
as a smooth effect with 5 degrees of freedom.The resulting formula, ~ party + bundesland + s(unemployment_rate, df = 5) + s(share_pop_migration, df = 5)
, is perfectly disputable – this decision is up to domain knowledge and some trial-and-error.
prevalence_formula <- as.formula(paste(
"",
"party + bundesland + s(unemployment_rate, df = 5) + s(share_pop_migration, df = 5)",
sep = "~"))
Now we have one thing to pay attention to: the stm
implementation cannot handle missing data in its prevalence variables. We must therefore exclude non-complete observations (alternatively, we could impute the missing values if it made sense and we had a meaningful imputation mechanism):
twitter_dfm <- quanteda::dfm_subset(
twitter_dfm,
!is.na(party) & !is.na(bundesland) & !is.na(unemployment_rate) &
!is.na(share_pop_migration))
twitter_dfm
## Document-feature matrix of: 1,199 documents, 300 features (98.8% sparse) and 17 docvars.
## features
## docs klima frau merkel groko klimakris diesel frag luft
## ABaerbock15156546001 1 1 1 0 0 0 0 0
## ABaerbock15210084001 0 0 1 1 0 0 0 0
## ABaerbock15216341401 0 0 0 0 1 0 0 0
## ABaerbock15252349801 0 0 0 0 0 1 1 1
## ABaerbock15256297201 0 0 0 0 0 0 0 0
## ABaerbock15283779001 0 0 0 1 0 0 0 0
## features
## docs strass polit
## ABaerbock15156546001 0 0
## ABaerbock15210084001 0 0
## ABaerbock15216341401 0 0
## ABaerbock15252349801 1 0
## ABaerbock15256297201 0 2
## ABaerbock15283779001 0 0
## [ reached max_ndoc ... 1,193 more documents, reached max_nfeat ... 290 more features ]
stm
objectWe then convert our dfm
to an stm
object we can use in the structural topic model. This is super simple with quanteda
:
twitter_stm <- quanteda::convert(twitter_dfm, to = "stm")
summary(twitter_stm)
## Length Class Mode
## documents 1158 -none- list
## vocab 300 -none- character
## meta 17 data.frame list
Time to fit the actual model.
While models learn all necessary parameters during training, hyperparameters need to be specified upfront. The most critical hyperparameter here is the number of topics, K. So how to find K?
Topic modeling being an unsupervised task (i.e., we do not have access to the ground truth), the quality of model output must ultimately be judged by human interpretability. We can seek some help from the data: stm
offers a function that searches for the optimal K (still requiring a range of potential values!). Exactly what optimal means can be somewhat steered by the user. We will use the held-out likelihood option which creates an artificial supervised task: we mask a certain part of some of the documents and ask the model to complete them (note that the ground truth is known now). The more predictive power our model has, the higher the likelihood for the held-out words should be (for details see Wallach et al., 2009:
# Conduct hyperparameter search
hyperparameter_search <- stm::searchK(
documents = twitter_stm$documents,
vocab = twitter_stm$vocab,
data = twitter_stm$meta,
K = c(3:10),
prevalence = prevalence_formula,
heldout.seed = 1, # seed for reproducibility
max.em.its = 5, # number of iterations
init.type = "Spectral",
verbose = FALSE)
# Fix optimal number of topics
(n_topics <- as.numeric(hyperparameter_search$results[
which.max(hyperparameter_search$results[, "heldout"]), "K"]))
## [1] 3
Seems 3 topics is a good choice, so let’s fit the actual model now:
# Fit STM
topic_model <- stm::stm(
documents = twitter_stm$documents,
vocab = twitter_stm$vocab,
data = twitter_stm$meta,
K = n_topics,
prevalence = prevalence_formula,
gamma.prior = "L1",
seed = 1,
max.em.its = 15,
init.type = "Spectral",
verbose = FALSE)
The tricky thing about unsupervised topic modeling is the interpretation part. Statistical computations yield the results that best fit the data according to model assumptions but whether the found topics are considered meaningful is up to the human analyst.
Let’s see whether we can make any sense of the model output by
For one, we can call stm
’s labelTopics
function which returns the top terms per topic according to different metrics (for details, best check the function documentation):
The n
argument allows to specify how many words should be displayed. We go for 10:
# Inspect results
(result_topic_modeling <- stm::labelTopics(topic_model, n = 10))
## Topic 1 Top Words:
## Highest Prob: afd, deutschland, cdu, bundesregier, bundestag, frau, polit, zeit, recht, land
## FREX: bundesregier, cdu, merkel, link, frau, csu, nicht, bundestag, fall, groko
## Lift: panik, app, putin, dafu, interess, taet, lockdown, bundesregier, link, schad
## Score: panik, afd, cdu, bundesregier, deutschland, bundestag, frau, polit, zeit, merkel
## Topic 2 Top Words:
## Highest Prob: jahr, corona, europa, klimaschutz, gruen, regier, heut, menschenrecht, euro, kris
## FREX: gruen, probl, kris, europa, thema, mrd, corona, mindestlohn, pandemi, scheu
## Lift: mindestlohn, scheu, macron, breg, noafd, mrd, verkehrsw, probl, nazi, thema
## Score: mindestlohn, corona, jahr, europa, gruen, kris, probl, grenz, mrd, debatt
## Topic 3 Top Words:
## Highest Prob: mensch, demokrati, dank, polizei, gewalt, tag, seehof, partei, zukunft, welt
## FREX: gewalt, tag, dank, einsatz, seit, glueckwunsch, antifa, rechtsstaat, staat, partei
## Lift: doppelmoral, respekt, einsatz, antifa, armut, gewalt, hall, jud, herzlich, lini
## Score: doppelmoral, mensch, dank, gewalt, demokrati, tag, staat, polizei, rechtsstaat, partei
We can also access the word lists directly if we are particularly interested in results w.r.t a specific metric:
t(result_topic_modeling$frex)
## [,1] [,2] [,3]
## [1,] "bundesregier" "gruen" "gewalt"
## [2,] "cdu" "probl" "tag"
## [3,] "merkel" "kris" "dank"
## [4,] "link" "europa" "einsatz"
## [5,] "frau" "thema" "seit"
## [6,] "csu" "mrd" "glueckwunsch"
## [7,] "nicht" "corona" "antifa"
## [8,] "bundestag" "mindestlohn" "rechtsstaat"
## [9,] "fall" "pandemi" "staat"
## [10,] "groko" "scheu" "partei"
NB: Our results here are not very glorious. We have already seen the extreme sparsity of the dfm
, and with documents that are pretty dissimilar a topic model can only do so much. This is also a consequence of our training data design: we have labels for relatively few, extremely short documents from a bunch of different authors. In reality, scraping gives us access to much larger amounts of data.
Next, let’s take another look at our topics by a more visually appealing means:
plot(topic_model, type = "summary")
We can also plot word clouds displaying the most prominent words per topic (size indicating the probability of a word occurring in a document of the given topic):
invisible(lapply(seq_len(n_topics), function(i) stm::cloud(topic_model, topic = i)))
Finally, let’s see the most representative documents for each topic, i.e., the tweets with the highest topic probability:
lapply(seq_len(n_topics), function(i) {
stm::findThoughts(
model = topic_model,
texts = twitter_stm$meta$text,
topics = i,
n = 2)$docs[[1]]})
## [[1]]
## [1] "Merkel wird auf europaeischer Ebene nicht mehr ernst genommen. Doch wie sollen Deutschlands Interessen weiterhin vertreten werden? Ihr Endspiel laeuft bereits... EU Bruessel AfD Af Dim Bundestag"
## [2] "Dieser Vorfall zeigt: DITIB kann kein Kooperationspartner sein. Wer Deradikalisierung und Extremismus-Praevention mit Radikalen und Extremisten betreibt, wird das Gegenteil erreichen. Islamismus darf nicht zu Deutschland gehoeren! Af Dim Bundestag AfD"
##
## [[2]]
## [1] "Ein Islamist im Herzen d frz Sicherheitsbehoerden im Kampf gegen Islamisten? frz Regierung deckt ihn? u dann mordet er 4 Polizisten? Und jetzt will Macron das vertuschen? An Euren Haenden klebt Blut! Ihr mordet mit! Unmoeglich in ? Ich fuerchte: nein."
## [2] "Wirklich eine sehr gute Wahl! Letztes Jahr hatte ich Kerry in Berlin getroffen, wo er in seinen Ausfuehrungen eine beeindruckende Bruecke von Klimaschutz zur internationalen Diplomatie geschlagen hat. Gut, dass die Regierung wieder zum Partner bei diesem so wichtigen Thema wird"
##
## [[3]]
## [1] "Der Staat hat nicht versagt, im Gegenteil. Er konnte den Anschlag verhindern, die Verbrecher fassen. Der Polizei und den Ermittlern gebuehrt groesster Dank, keine Beschimpfung. Es geht in erster Linie weder um Tuerken, Mazedonier oder Offenbacher. Nur um hochgefaehrliche Verbrecher"
## [2] "Es ist einfach nur widerlich, wie sich rechte Hetzer ueber den Tod von Walter Luebcke freuen. KommunalpolitikerInnen und Ehrenamtliche haben unseren Respekt verdient. Unsere Demokratie lebt von ihrem tagtaeglichen Engagement vor Ort."
Using this visual support (and probably playing around with, e.g., the number of features to include in the dfm
and the number of topics) we would now try to give our topic some decent names.
So stm
has thrown a lot of words at us from which we are hopefully able to infer meaningful topics, but which documents actually belong to which topic? The model output provides us with a vector of topic probabilities for each document, meaning we can label the tweets by picking the topic for which they have the highest respective probability:
# Extract topic probabilities together with topic IDs and discard docnum
topic_probs <- stm::make.dt(topic_model)[
, doc_id := names(twitter_stm$documents)
][, docnum := NULL]
topic_probs[sample(topic_probs[, .I], 5)]
## Topic1 Topic2 Topic3 doc_id
## 1: 0.3663729 0.3722390 0.2613881 DirkSpaniel15607438801
## 2: 0.2725094 0.2210692 0.5064214 groehe16033014001
## 3: 0.5790691 0.1983266 0.2226043 ulschzi15252663601
## 4: 0.2718234 0.3276447 0.4005319 Schwarz_MdB15596403001
## 5: 0.4698024 0.3066046 0.2235930 HajdukBundestag15518854801
# Get topic with highest score per document
topic_cols <- names(topic_probs)[startsWith(names(topic_probs), "Topic")]
topic_probs[
, `:=` (
max_topic_score = max(.SD, na.rm = TRUE),
topic_label = which.max(.SD)),
.SDcols = topic_cols,
by = doc_id]
topic_probs[sample(topic_probs[, .I], 5)]
## Topic1 Topic2 Topic3 doc_id max_topic_score
## 1: 0.3772191 0.2369846 0.3857962 n_roettgen16067506801 0.3857962
## 2: 0.2597242 0.2030802 0.5371956 akbulutgokay15821783401 0.5371956
## 3: 0.2418535 0.2512492 0.5068973 HeikoMaas15509916001 0.5068973
## 4: 0.3190105 0.4209160 0.2600735 HuberMdB15628304401 0.4209160
## 5: 0.4162258 0.3092677 0.2745065 Beatrix_vStorch15790920601 0.4162258
## topic_label
## 1: 3
## 2: 3
## 3: 3
## 4: 2
## 5: 1
For mapping the topic labels back to the original tweets append them to the docvars…
# Extract docvars
twitter_docvars <- data.table::as.data.table(
cbind(
doc_id = quanteda::docid(twitter_corpus),
quanteda::docvars(twitter_corpus)),
key = "doc_id")
# Append topic labels and remove irrelevant columns
twitter_docvars <- topic_probs[
twitter_docvars, on = "doc_id"
][, c(topic_cols) := NULL]
twitter_docvars[sample(twitter_docvars[, .I], 5)]
## doc_id max_topic_score topic_label last_name
## 1: KonstantinNotz15645111001 0.4087667 1 Notz
## 2: SBarrientosK15490968601 0.4485621 1 Barrientos
## 3: mueller_sepp15886573801 0.3639534 2 Müller
## 4: HeikoMaas15706258801 0.4146403 3 Maas
## 5: KerstinGriese15436363201 0.4631303 3 Griese
## first_name wahlkreis_name party
## 1: Dr. Konstantin von Herzogtum Lauenburg – Stormarn-Süd gruene
## 2: Simone Würzburg linke
## 3: Sepp Dessau – Wittenberg cdu_csu
## 4: Heiko Saarlouis spd
## 5: Kerstin Mettmann II spd
## bundesland unemployment_rate share_pop_migration username
## 1: Schleswig-Holstein 4.8 12.7 KonstantinNotz
## 2: Bayern 3.0 16.3 SBarrientosK
## 3: Sachsen-Anhalt 9.1 3.2 mueller_sepp
## 4: Saarland 5.5 16.1 HeikoMaas
## 5: Nordrhein-Westfalen 6.4 24.4 KerstinGriese
## followers_count created_at favorite_count retweet_count label
## 1: 69550 2019-07-30 18:25:00 743 123 negative
## 2: 1551 2019-02-02 08:41:00 0 0 negative
## 3: 1387 2020-05-05 05:43:00 5 1 negative
## 4: 434412 2019-10-09 12:58:00 933 178 negative
## 5: 6705 2018-12-01 03:52:00 9 2 positive
## emojis hashtags tags
## 1: #Frankfurt,#AfD
## 2: #wegmit219a
## 3: #Corona,#Autogipfel @DEHOGA_BV,@dehoga_st
## 4: #YomKippur,#Halle
## 5: #Ratingen @larsklingbeil
## text
## 1: Die erschuetternde Tat von Frankfurt mit der Fluechtlingskrise 2015 zu verknuepfen, um Profit daraus zu schlagen, ist offenkundig abwegig und politisch einfach maximal widerlich - hat aber bei der AfD offensichtlich System.
## 2: Es ist ein Skandal, dass die Bundesregierung sich darauf einigt, Frauen weiterhin zu entmuendigen und gesundheitliche Aufklaerung zu verhindern. wegmit219a. DIE LINKE kaempft fuer einen legalen Zugang zu Abtreibungen und fuer das Informations- und Selbstbestimmungsrecht der Frauen.
## 3: 2/2 Dagegen stehen viele Familienunternehmer mit tausenden Beschaeftigten, im Hotel- und Gaststaettensektor, am Rande ihrer Existenz. Hier weiter gezielt zu helfen und zu oeffnen ist das Gebot der Stunde. Nicht mehr und nicht weniger. Corona Autogipfel
## 4: Dass am Versoehnungsfest Yom Kippur auf eine Synagoge geschossen wird, trifft uns ins Herz. Wir alle muessen gegen den Antisemitismus in unserem Land vorgehen. In diesen schweren Stunden sind meine Gedanken bei den Toten und Verletzten, ihren Angehoerigen und der Polizei. Halle
## 5: Herzlichen Dank fuer den tollen Abend in Ratingen bei Kerstin Griese trifft... Munter diskutiert: Digitalisierung, Zukunft d. Arbeit, Qualifizierung u. Chancen, schnelle Netze, Weiterbildung, Demokratie im Internet, SPD.
…and, lastly, feed this information back into the original corpus object to save it for further usage:
# Insert info back into corpus
twitter_corpus_with_topics <- twitter_corpus
quanteda::docvars(twitter_corpus_with_topics) <- as.data.frame(twitter_docvars)
# Save
saveRDS(
twitter_corpus_with_topics,
sprintf("%s/twitter_corpus_with_topics.RDS", path))
So that’s how to model topics with an stm
! Even though our training corpus is not easily clustered into topics, you hopefully got some idea how to handle this process code-wise. And then, we just do not always get lucky with such tasks in a sense that they can be expected to produce meaningful output all the time ¯|(ツ)/¯