Modeling Topics with an STM

This demo is about topic modeling with a structural topic model (STM).

We show how to prepare the data for fitting an STM (Roberts et al., 2016), conduct the modeling process, and analyze the results.

# Load required packages

library(data.table)
library(quanteda)
library(stm)

Create dfm object

As in the static feature extraction demo, we convert our data to a dfm. However, we focus on slightly different tokens this time as we are now interested in identifying topical relations.

First, we get the data (the path variable again requiring individual adaptation):

# path <- ... (individual file location)

twitter_corpus <- readRDS(sprintf("%s/twitter_corpus.RDS", path))

twitter_corpus
## Corpus consisting of 1,215 documents and 16 docvars.
## ABaerbock15156546001 :
## "Habe mir das Gro Ko-Sondierungspapier zu Klima nochmal genau..."
## 
## ABaerbock15210084001 :
## "Auch weltweit sieht man, dass Angela Merkel s GroKo 3.0 nich..."
## 
## ABaerbock15216341401 :
## "Der groessten globalen Herausforderung, der Klimakrise, 30 S..."
## 
## ABaerbock15252349801 :
## "Wir brauchen eine andere Verkehrspolitik - weg von Benzinern..."
## 
## ABaerbock15256297201 :
## "Das ist die Leistung von Hunderten Wahlkaempfern und Politik..."
## 
## ABaerbock15283779001 :
## "Luise Amtsberg hat heute alles gesagt,was es zum Familiennac..."
## 
## [ reached max_ndoc ... 1,209 more documents ]

We furthermore store the document texts in the docvars (which poses a redundancy in the corpus object but ensures the texts are not lost during conversion for the dfm).

quanteda::docvars(twitter_corpus)$text <- quanteda::texts(twitter_corpus)

Now let’s tokenize our corpus in a way appropriate for topic modeling.

We assume that topical information is mostly carried by nouns and choose to remove all other words for document representation, attempting to eliminate noise that might distract our topic model. This is, of course, a simplification, but that’s life (of an analyst).

Luckily, nouns are pretty easy to identify by their leading capital letters in German (implicitly relying on MPs’ grammar skills and/or auto-correct, and accepting that this will also include non-nouns from the beginning of a sentence):

# First, tokenize as usual

twitter_tokens <- quanteda::tokens(
  twitter_corpus,
  what = "word",
  remove_symbols = TRUE,
  remove_numbers = TRUE,
  remove_separators = TRUE,
  split_hyphens = TRUE,
  include_docvars = TRUE)
# Stem

twitter_tokens <- quanteda::tokens_wordstem(
  twitter_tokens, 
  language = "german")
# Keep only tokens starting with a capital letter

twitter_tokens <- quanteda::tokens_keep(
  twitter_tokens,
  pattern = c("[:upper:]([:lower:])+"),
  valuetype = "regex",
  case_insensitive = FALSE)

# Lowercase afterwards

twitter_tokens <- quanteda::tokens_tolower(twitter_tokens)

For stopwords removal we retrace the steps taken in the static feature extraction:

# Remove stopwords using our cleaning function from earlier

stopwords <- clean_and_stem(quanteda::stopwords(language = "de"))
stopwords <- stopwords[nchar(stopwords) > 0]
stopwords <- stringr::str_remove_all(
  stopwords,
  "kein(.)*|nicht")
stopwords <- c(
  stopwords, 
  c("der", "die", "das", "was", "wer", "wie", "ich", "sie", "wir", "ihr"))

saveRDS(stopwords, sprintf("%s/stopwords.RDS", path))

twitter_tokens <- quanteda::tokens_remove(
  twitter_tokens,
  pattern = stopwords)

twitter_tokens <- quanteda::tokens_select(twitter_tokens, min_nchar = 3)

# Inspect

twitter_tokens
## Tokens consisting of 1,215 documents and 17 docvars.
## ABaerbock15156546001 :
##  [1] "gro"             "sondierungspapi" "klima"           "krass"          
##  [5] "kyoto"           "protokoll"       "meilenstein"     "klimadiplomati" 
##  [9] "frau"            "merkel"          "nachbess"       
## 
## ABaerbock15210084001 :
## [1] "angela"              "merkel"              "groko"              
## [4] "paris"               "abkomm"              "dah"                
## [7] "klimasofortprogramm"
## 
## ABaerbock15216341401 :
## [1] "herausforder"      "klimakris"         "sek"              
## [4] "std"               "regierungserklaer" "leerstell"        
## [7] "koalitionsvertrag" "fuess"             "klimafolg"        
## 
## ABaerbock15252349801 :
## [1] "verkehrspolit" "benzin"        "diesel"        "mobilitaet"   
## [5] "frag"          "luft"          "wohnung"       "strass"       
## 
## ABaerbock15256297201 :
## [1] "leistung"     "hundert"      "wahlkaempf"   "polit"        "land"        
## [6] "mensch"       "konkret"      "polit"        "kommunalwahl"
## 
## ABaerbock15283779001 :
##  [1] "luis"            "amtsberg"        "familiennachzug" "integration"    
##  [5] "gefluechtet"     "perspektiv"      "zusammenleb"     "famili"         
##  [9] "grundrecht"      "famili"          "groko"          
## 
## [ reached max_ndoc ... 1,209 more documents ]

Looking good, so convert to dfm:

twitter_dfm <- quanteda::dfm(twitter_tokens)

twitter_dfm
## Document-feature matrix of: 1,215 documents, 4,923 features (99.8% sparse) and 17 docvars.
##                       features
## docs                   gro sondierungspapi klima krass kyoto protokoll
##   ABaerbock15156546001   1               1     1     1     1         1
##   ABaerbock15210084001   0               0     0     0     0         0
##   ABaerbock15216341401   0               0     0     0     0         0
##   ABaerbock15252349801   0               0     0     0     0         0
##   ABaerbock15256297201   0               0     0     0     0         0
##   ABaerbock15283779001   0               0     0     0     0         0
##                       features
## docs                   meilenstein klimadiplomati frau merkel
##   ABaerbock15156546001           1              1    1      1
##   ABaerbock15210084001           0              0    0      1
##   ABaerbock15216341401           0              0    0      0
##   ABaerbock15252349801           0              0    0      0
##   ABaerbock15256297201           0              0    0      0
##   ABaerbock15283779001           0              0    0      0
## [ reached max_ndoc ... 1,209 more documents, reached max_nfeat ... 4,913 more features ]

Unfortunately, as the above output tells us, our dfm is extremely sparse – tokens are not at all common across documents, which will pose a challenge to any topic model. This is actually not too surprising when we recall the extreme brevity of our Twitter documents.

Side note: if you run into this problem you might want to consider the option of pooling texts into larger documents (this can be achieved, for instance, calling quanteda::dfm_group). However, this is obviously a heavy simplification and leads to all documents that share the same grouping variable being assigned the same topic label.

What we will do to alleviate the sparsity problem at least to some extent is to condense the dfm to the most prominent features:

# Eliminate less common features

twitter_dfm <- quanteda::dfm_select(
  twitter_dfm,
  names(quanteda::topfeatures(twitter_dfm, n = 300)))

twitter_dfm
## Document-feature matrix of: 1,215 documents, 300 features (98.8% sparse) and 17 docvars.
##                       features
## docs                   klima frau merkel groko klimakris diesel frag luft
##   ABaerbock15156546001     1    1      1     0         0      0    0    0
##   ABaerbock15210084001     0    0      1     1         0      0    0    0
##   ABaerbock15216341401     0    0      0     0         1      0    0    0
##   ABaerbock15252349801     0    0      0     0         0      1    1    1
##   ABaerbock15256297201     0    0      0     0         0      0    0    0
##   ABaerbock15283779001     0    0      0     1         0      0    0    0
##                       features
## docs                   strass polit
##   ABaerbock15156546001      0     0
##   ABaerbock15210084001      0     0
##   ABaerbock15216341401      0     0
##   ABaerbock15252349801      1     0
##   ABaerbock15256297201      0     2
##   ABaerbock15283779001      0     0
## [ reached max_ndoc ... 1,209 more documents, reached max_nfeat ... 290 more features ]

Create prevalence formula

The special trait of the STM is its making use of s document-level meta data, so we need to specify which information to use here. In the stm function the corresponding argument is called prevalence and must be stated as a formula.

We will use the following meta variables (smooth effects defaulting to B-splines):

  • party as a categorical variable,
  • bundesland as a categorical variable,
  • unemployment_rate as a smooth effect with 5 degrees of freedom, and
  • share_pop_migration as a smooth effect with 5 degrees of freedom.

The resulting formula, ~ party + bundesland + s(unemployment_rate, df = 5) + s(share_pop_migration, df = 5), is perfectly disputable – this decision is up to domain knowledge and some trial-and-error.

prevalence_formula <- as.formula(paste(
  "", 
  "party + bundesland + s(unemployment_rate, df = 5) + s(share_pop_migration, df = 5)", 
  sep = "~"))

Now we have one thing to pay attention to: the stm implementation cannot handle missing data in its prevalence variables. We must therefore exclude non-complete observations (alternatively, we could impute the missing values if it made sense and we had a meaningful imputation mechanism):

twitter_dfm <- quanteda::dfm_subset(
  twitter_dfm,
  !is.na(party) & !is.na(bundesland) & !is.na(unemployment_rate) &
    !is.na(share_pop_migration))

twitter_dfm
## Document-feature matrix of: 1,199 documents, 300 features (98.8% sparse) and 17 docvars.
##                       features
## docs                   klima frau merkel groko klimakris diesel frag luft
##   ABaerbock15156546001     1    1      1     0         0      0    0    0
##   ABaerbock15210084001     0    0      1     1         0      0    0    0
##   ABaerbock15216341401     0    0      0     0         1      0    0    0
##   ABaerbock15252349801     0    0      0     0         0      1    1    1
##   ABaerbock15256297201     0    0      0     0         0      0    0    0
##   ABaerbock15283779001     0    0      0     1         0      0    0    0
##                       features
## docs                   strass polit
##   ABaerbock15156546001      0     0
##   ABaerbock15210084001      0     0
##   ABaerbock15216341401      0     0
##   ABaerbock15252349801      1     0
##   ABaerbock15256297201      0     2
##   ABaerbock15283779001      0     0
## [ reached max_ndoc ... 1,193 more documents, reached max_nfeat ... 290 more features ]

Create stm object

We then convert our dfm to an stm object we can use in the structural topic model. This is super simple with quanteda:

twitter_stm <- quanteda::convert(twitter_dfm, to = "stm")

summary(twitter_stm)
##           Length Class      Mode     
## documents 1158   -none-     list     
## vocab      300   -none-     character
## meta        17   data.frame list

Fit STM

Time to fit the actual model.

While models learn all necessary parameters during training, hyperparameters need to be specified upfront. The most critical hyperparameter here is the number of topics, K. So how to find K?

Topic modeling being an unsupervised task (i.e., we do not have access to the ground truth), the quality of model output must ultimately be judged by human interpretability. We can seek some help from the data: stm offers a function that searches for the optimal K (still requiring a range of potential values!). Exactly what optimal means can be somewhat steered by the user. We will use the held-out likelihood option which creates an artificial supervised task: we mask a certain part of some of the documents and ask the model to complete them (note that the ground truth is known now). The more predictive power our model has, the higher the likelihood for the held-out words should be (for details see Wallach et al., 2009:

# Conduct hyperparameter search

hyperparameter_search <- stm::searchK(
  documents = twitter_stm$documents,
  vocab = twitter_stm$vocab,
  data = twitter_stm$meta,
  K = c(3:10),
  prevalence = prevalence_formula,
  heldout.seed = 1, # seed for reproducibility
  max.em.its = 5, # number of iterations
  init.type = "Spectral",
  verbose = FALSE)
# Fix optimal number of topics

(n_topics <- as.numeric(hyperparameter_search$results[
  which.max(hyperparameter_search$results[, "heldout"]), "K"]))
## [1] 3

Seems 3 topics is a good choice, so let’s fit the actual model now:

# Fit STM

topic_model <- stm::stm(
  documents = twitter_stm$documents,
  vocab = twitter_stm$vocab,
  data = twitter_stm$meta,
  K = n_topics,
  prevalence = prevalence_formula,
  gamma.prior = "L1",
  seed = 1,
  max.em.its = 15,
  init.type = "Spectral",
  verbose = FALSE)

Interpret results

The tricky thing about unsupervised topic modeling is the interpretation part. Statistical computations yield the results that best fit the data according to model assumptions but whether the found topics are considered meaningful is up to the human analyst.

Let’s see whether we can make any sense of the model output by

  • inspecting the most characteristic words for each topic,
  • printing the most representative document for a given topic, and
  • generating a variety of visualizations.

For one, we can call stm’s labelTopics function which returns the top terms per topic according to different metrics (for details, best check the function documentation):

  • Highest probability: which words are most likely to occur in a document of the given topic?
  • FREX: which words occur both frequently and exclusively, i.e., serve to characterize and distinguish topics?
  • Lift: which words occur most frequently if weighted by the inverse of their relative empirical frequency across the entire corpus?
  • Score: which words show the largest difference between their (log) frequency in topic k and their average (log) frequency across all topics?

The n argument allows to specify how many words should be displayed. We go for 10:

# Inspect results

(result_topic_modeling <- stm::labelTopics(topic_model, n = 10))
## Topic 1 Top Words:
##       Highest Prob: afd, deutschland, cdu, bundesregier, bundestag, frau, polit, zeit, recht, land 
##       FREX: bundesregier, cdu, merkel, link, frau, csu, nicht, bundestag, fall, groko 
##       Lift: panik, app, putin, dafu, interess, taet, lockdown, bundesregier, link, schad 
##       Score: panik, afd, cdu, bundesregier, deutschland, bundestag, frau, polit, zeit, merkel 
## Topic 2 Top Words:
##       Highest Prob: jahr, corona, europa, klimaschutz, gruen, regier, heut, menschenrecht, euro, kris 
##       FREX: gruen, probl, kris, europa, thema, mrd, corona, mindestlohn, pandemi, scheu 
##       Lift: mindestlohn, scheu, macron, breg, noafd, mrd, verkehrsw, probl, nazi, thema 
##       Score: mindestlohn, corona, jahr, europa, gruen, kris, probl, grenz, mrd, debatt 
## Topic 3 Top Words:
##       Highest Prob: mensch, demokrati, dank, polizei, gewalt, tag, seehof, partei, zukunft, welt 
##       FREX: gewalt, tag, dank, einsatz, seit, glueckwunsch, antifa, rechtsstaat, staat, partei 
##       Lift: doppelmoral, respekt, einsatz, antifa, armut, gewalt, hall, jud, herzlich, lini 
##       Score: doppelmoral, mensch, dank, gewalt, demokrati, tag, staat, polizei, rechtsstaat, partei

We can also access the word lists directly if we are particularly interested in results w.r.t a specific metric:

t(result_topic_modeling$frex)
##       [,1]           [,2]          [,3]          
##  [1,] "bundesregier" "gruen"       "gewalt"      
##  [2,] "cdu"          "probl"       "tag"         
##  [3,] "merkel"       "kris"        "dank"        
##  [4,] "link"         "europa"      "einsatz"     
##  [5,] "frau"         "thema"       "seit"        
##  [6,] "csu"          "mrd"         "glueckwunsch"
##  [7,] "nicht"        "corona"      "antifa"      
##  [8,] "bundestag"    "mindestlohn" "rechtsstaat" 
##  [9,] "fall"         "pandemi"     "staat"       
## [10,] "groko"        "scheu"       "partei"

NB: Our results here are not very glorious. We have already seen the extreme sparsity of the dfm, and with documents that are pretty dissimilar a topic model can only do so much. This is also a consequence of our training data design: we have labels for relatively few, extremely short documents from a bunch of different authors. In reality, scraping gives us access to much larger amounts of data.

Next, let’s take another look at our topics by a more visually appealing means:

plot(topic_model, type = "summary")

We can also plot word clouds displaying the most prominent words per topic (size indicating the probability of a word occurring in a document of the given topic):

invisible(lapply(seq_len(n_topics), function(i) stm::cloud(topic_model, topic = i)))

Finally, let’s see the most representative documents for each topic, i.e., the tweets with the highest topic probability:

lapply(seq_len(n_topics), function(i) {
  stm::findThoughts(
    model = topic_model,
    texts = twitter_stm$meta$text,
    topics = i,
    n = 2)$docs[[1]]})
## [[1]]
## [1] "Merkel wird auf europaeischer Ebene nicht mehr ernst genommen. Doch wie sollen Deutschlands Interessen weiterhin vertreten werden? Ihr Endspiel laeuft bereits... EU Bruessel AfD Af Dim Bundestag"                                                         
## [2] "Dieser Vorfall zeigt: DITIB kann kein Kooperationspartner sein. Wer Deradikalisierung und Extremismus-Praevention mit Radikalen und Extremisten betreibt, wird das Gegenteil erreichen. Islamismus darf nicht zu Deutschland gehoeren! Af Dim Bundestag AfD"
## 
## [[2]]
## [1] "Ein Islamist im Herzen d frz Sicherheitsbehoerden im Kampf gegen Islamisten? frz Regierung deckt ihn? u dann mordet er 4 Polizisten? Und jetzt will Macron das vertuschen? An Euren Haenden klebt Blut! Ihr mordet mit! Unmoeglich in ? Ich fuerchte: nein."                           
## [2] "Wirklich eine sehr gute Wahl! Letztes Jahr hatte ich Kerry in Berlin getroffen, wo er in seinen Ausfuehrungen eine beeindruckende Bruecke von Klimaschutz zur internationalen Diplomatie geschlagen hat. Gut, dass die Regierung wieder zum Partner bei diesem so wichtigen Thema wird"
## 
## [[3]]
## [1] "Der Staat hat nicht versagt, im Gegenteil. Er konnte den Anschlag verhindern, die Verbrecher fassen. Der Polizei und den Ermittlern gebuehrt groesster Dank, keine Beschimpfung. Es geht in erster Linie weder um Tuerken, Mazedonier oder Offenbacher. Nur um hochgefaehrliche Verbrecher"
## [2] "Es ist einfach nur widerlich, wie sich rechte Hetzer ueber den Tod von Walter Luebcke freuen. KommunalpolitikerInnen und Ehrenamtliche haben unseren Respekt verdient. Unsere Demokratie lebt von ihrem tagtaeglichen Engagement vor Ort."

Using this visual support (and probably playing around with, e.g., the number of features to include in the dfm and the number of topics) we would now try to give our topic some decent names.

Assign documents to topics

So stm has thrown a lot of words at us from which we are hopefully able to infer meaningful topics, but which documents actually belong to which topic? The model output provides us with a vector of topic probabilities for each document, meaning we can label the tweets by picking the topic for which they have the highest respective probability:

# Extract topic probabilities together with topic IDs and discard docnum

topic_probs <- stm::make.dt(topic_model)[
  , doc_id := names(twitter_stm$documents)
  ][, docnum := NULL]

topic_probs[sample(topic_probs[, .I], 5)]
##       Topic1    Topic2    Topic3                     doc_id
## 1: 0.3663729 0.3722390 0.2613881     DirkSpaniel15607438801
## 2: 0.2725094 0.2210692 0.5064214          groehe16033014001
## 3: 0.5790691 0.1983266 0.2226043         ulschzi15252663601
## 4: 0.2718234 0.3276447 0.4005319     Schwarz_MdB15596403001
## 5: 0.4698024 0.3066046 0.2235930 HajdukBundestag15518854801
# Get topic with highest score per document

topic_cols <- names(topic_probs)[startsWith(names(topic_probs), "Topic")]

topic_probs[
  , `:=` (
    max_topic_score = max(.SD, na.rm = TRUE),
    topic_label = which.max(.SD)),
  .SDcols = topic_cols,
  by = doc_id]

topic_probs[sample(topic_probs[, .I], 5)]
##       Topic1    Topic2    Topic3                     doc_id max_topic_score
## 1: 0.3772191 0.2369846 0.3857962      n_roettgen16067506801       0.3857962
## 2: 0.2597242 0.2030802 0.5371956    akbulutgokay15821783401       0.5371956
## 3: 0.2418535 0.2512492 0.5068973       HeikoMaas15509916001       0.5068973
## 4: 0.3190105 0.4209160 0.2600735        HuberMdB15628304401       0.4209160
## 5: 0.4162258 0.3092677 0.2745065 Beatrix_vStorch15790920601       0.4162258
##    topic_label
## 1:           3
## 2:           3
## 3:           3
## 4:           2
## 5:           1

For mapping the topic labels back to the original tweets append them to the docvars…

# Extract docvars

twitter_docvars <- data.table::as.data.table(
  cbind(
    doc_id = quanteda::docid(twitter_corpus),
    quanteda::docvars(twitter_corpus)), 
  key = "doc_id")

# Append topic labels and remove irrelevant columns

twitter_docvars <- topic_probs[
  twitter_docvars, on = "doc_id"
    ][, c(topic_cols) := NULL]

twitter_docvars[sample(twitter_docvars[, .I], 5)]
##                       doc_id max_topic_score topic_label  last_name
## 1: KonstantinNotz15645111001       0.4087667           1       Notz
## 2:   SBarrientosK15490968601       0.4485621           1 Barrientos
## 3:   mueller_sepp15886573801       0.3639534           2     Müller
## 4:      HeikoMaas15706258801       0.4146403           3       Maas
## 5:  KerstinGriese15436363201       0.4631303           3     Griese
##            first_name                     wahlkreis_name   party
## 1: Dr. Konstantin von Herzogtum Lauenburg – Stormarn-Süd  gruene
## 2:             Simone                           Würzburg   linke
## 3:               Sepp                Dessau – Wittenberg cdu_csu
## 4:              Heiko                          Saarlouis     spd
## 5:            Kerstin                        Mettmann II     spd
##             bundesland unemployment_rate share_pop_migration       username
## 1:  Schleswig-Holstein               4.8                12.7 KonstantinNotz
## 2:              Bayern               3.0                16.3   SBarrientosK
## 3:      Sachsen-Anhalt               9.1                 3.2   mueller_sepp
## 4:            Saarland               5.5                16.1      HeikoMaas
## 5: Nordrhein-Westfalen               6.4                24.4  KerstinGriese
##    followers_count          created_at favorite_count retweet_count    label
## 1:           69550 2019-07-30 18:25:00            743           123 negative
## 2:            1551 2019-02-02 08:41:00              0             0 negative
## 3:            1387 2020-05-05 05:43:00              5             1 negative
## 4:          434412 2019-10-09 12:58:00            933           178 negative
## 5:            6705 2018-12-01 03:52:00              9             2 positive
##    emojis            hashtags                  tags
## 1:            #Frankfurt,#AfD                      
## 2:                #wegmit219a                      
## 3:        #Corona,#Autogipfel @DEHOGA_BV,@dehoga_st
## 4:          #YomKippur,#Halle                      
## 5:                  #Ratingen        @larsklingbeil
##                                                                                                                                                                                                                                                                                          text
## 1:                                                            Die erschuetternde Tat von Frankfurt mit der Fluechtlingskrise 2015 zu verknuepfen, um Profit daraus zu schlagen, ist offenkundig abwegig und politisch einfach maximal widerlich - hat aber bei der AfD offensichtlich System.
## 2: Es ist ein Skandal, dass die Bundesregierung sich darauf einigt, Frauen weiterhin zu entmuendigen und gesundheitliche Aufklaerung zu verhindern. wegmit219a. DIE LINKE kaempft fuer einen legalen Zugang zu Abtreibungen und fuer das Informations- und Selbstbestimmungsrecht der Frauen.
## 3:                               2/2 Dagegen stehen viele Familienunternehmer mit tausenden Beschaeftigten, im Hotel- und Gaststaettensektor, am Rande ihrer Existenz. Hier weiter gezielt zu helfen und zu oeffnen ist das Gebot der Stunde. Nicht mehr und nicht weniger. Corona Autogipfel
## 4:     Dass am Versoehnungsfest Yom Kippur auf eine Synagoge geschossen wird, trifft uns ins Herz. Wir alle muessen gegen den Antisemitismus in unserem Land vorgehen. In diesen schweren Stunden sind meine Gedanken bei den Toten und Verletzten, ihren Angehoerigen und der Polizei. Halle
## 5:                                                               Herzlichen Dank fuer den tollen Abend in Ratingen bei Kerstin Griese trifft... Munter diskutiert: Digitalisierung, Zukunft d. Arbeit, Qualifizierung u. Chancen, schnelle Netze, Weiterbildung, Demokratie im Internet, SPD.

…and, lastly, feed this information back into the original corpus object to save it for further usage:

# Insert info back into corpus

twitter_corpus_with_topics <- twitter_corpus

quanteda::docvars(twitter_corpus_with_topics) <- as.data.frame(twitter_docvars)

# Save

saveRDS(
  twitter_corpus_with_topics, 
  sprintf("%s/twitter_corpus_with_topics.RDS", path))

So that’s how to model topics with an stm! Even though our training corpus is not easily clustered into topics, you hopefully got some idea how to handle this process code-wise. And then, we just do not always get lucky with such tasks in a sense that they can be expected to produce meaningful output all the time ¯|(ツ)