This demo is about computing topic-specific word embeddings with GloVe.
We show how to compute word embeddings via GloVe for a corpus segmented into multiple topics, where the embedding part is of course applicable to a non-partitioned corpus as well.
The code is based on the text2vec
package, much of which is built in R6
. Covering object-oriented programming is beyond the scope of this course, but we will mention some basics during sentiment analysis with mlr3
.
# Load required packages
library(Matrix)
library(quanteda)
library(text2vec)
We start by reading the data we have created during the topic modeling process:
# path <- ... (individual file location)
twitter_corpus <- readRDS(sprintf("%s/twitter_corpus_with_topics.RDS", path))
Next, we subset the corpus by topic label:
# Recode NA labels
twitter_corpus$topic_label <- ifelse(
is.na(twitter_corpus$topic_label),
99,
twitter_corpus$topic_label)
# Subset corpus
twitter_corpus_subsets <- lapply(
unique(twitter_corpus$topic_label),
function(i) quanteda::corpus_subset(twitter_corpus, topic_label == i))
sapply(twitter_corpus_subsets, quanteda::ndoc)
## [1] 498 237 423 57
We will now define the embedding procedure and encapsulate it in a function we can then apply to each corpus subset. It is typically advisable to break down such routines into sub-functions (top-down programming). This keeps the code readable and is much easier to debug.
Our embedding function should have the following structure (regard this as pseudo-code):
make_glove_embeddings <- function(text, stopwords, glove_args) {
tokens <- tokenize(text, stopwords)
vocabulary <- make_vocab(tokens, glove_args)
glove <- make_glove_instance(vocabulary, glove_args)
word_vectors <- get_word_vectors(tokens, glove, glove_args)
doc_embeddings <- get_doc_embeddings(tokens, word_vectors)
doc_embeddings
}
Tokenization feels pretty been-there-done-that by now:
tokenize <- function(text, stopwords) {
tokens <- quanteda::tokens(
text,
what = "word",
remove_symbols = TRUE,
remove_punct = TRUE,
remove_numbers = TRUE,
remove_separators = TRUE,
split_hyphens = TRUE,
include_docvars = TRUE)
tokens <- quanteda::tokens_wordstem(tokens, language = "german")
tokens <- quanteda::tokens_remove(
quanteda::tokens_tolower(tokens),
pattern = stopwords)
tokens <- quanteda::tokens_select(tokens, min_nchar = 3)
tokens
}
stopwords <- readRDS(sprintf("%s/stopwords.RDS", path))
test_tokens <- tokenize(twitter_corpus, stopwords)
test_tokens[1:2]
## Tokens consisting of 2 documents and 20 docvars.
## ABaerbock15156546001 :
## [1] "gro" "sondierungspapi" "klima" "nochmal"
## [5] "genau" "angeschaut" "krass" "facto"
## [9] "sogar" "kyoto" "protokoll" "meilenstein"
## [ ... and 9 more ]
##
## ABaerbock15210084001 :
## [1] "weltweit" "sieht" "angela" "merkel" "groko"
## [6] "nicht" "klimataug" "sond" "vielmehr" "paris"
## [11] "abkomm" "unterlaeuft"
## [ ... and 5 more ]
The next step is to create a vocabulary. This is where text2vec
first comes into play:
make_vocab <- function(tokens, term_count_min) {
# Convert tokens to list
tokens <- as.list(tokens)
# Create iterator (kind of list from which vocabulary can be created)
itokens <- text2vec::itoken(tokens, progressbar = FALSE)
# Create vocabulary and retain only words of a certain frequency
vocab <- text2vec::create_vocabulary(itokens)
vocab <- text2vec::prune_vocabulary(vocab, term_count_min = term_count_min)
list(itokens = itokens, vocab = vocab)
}
test_vocab <- make_vocab(test_tokens, term_count_min = 2)
summary(test_vocab$vocab)
## term term_count doc_count
## Length:2913 Min. : 2.000 Min. : 1.000
## Class :character 1st Qu.: 2.000 1st Qu.: 2.000
## Mode :character Median : 3.000 Median : 3.000
## Mean : 6.332 Mean : 6.038
## 3rd Qu.: 6.000 3rd Qu.: 6.000
## Max. :442.000 Max. :358.000
Let’s now create an instance of the GloVe class (if you are not familiar with object-oriented programming, imagine this as some kind of abstract machine we can kick-start to compute our embeddings).
This commands the specification of some additional arguments:
skip_grams_window
, which we will later set rather small as our texts are super short,dimension
, arguably the most important hyperparameter, andx_max
, controlling the GloVe-inherent weighting function (not that critical for us as we have only very few words with multiple co-occurrences)make_glove_instance <- function(vocabulary, skip_grams_window, dimension, x_max) {
vect <- text2vec::vocab_vectorizer(vocabulary$vocab)
tcm <- text2vec::create_tcm(
vocabulary$itokens,
vect,
skip_grams_window = skip_grams_window)
glove_instance <- text2vec::GlobalVectors$new(rank = dimension, x_max = x_max)
list(tcm = tcm, glove_instance = glove_instance)
}
test_model <- make_glove_instance(
test_vocab,
skip_grams_window = 5,
dimension = 3,
x_max = 5)
With this somewhat hard-to-grasp container we can fit the GloVe (no bad jokes intended).
Again, there are some (rather technical) hyperparameters to set:
iterations
– how many iterations do we allow for the optimization process?convergence_tol
– how large is our tolerance in declaring convergence (the process will terminate if either convergence is reached or the number of iterations is exhausted)?Besides, we note that the word vectors are composed of the actual word vectors, here called wv_main
, and some “context” vectors wv_context
. The original paper is somewhat cryptic about the purpose of this double computation but it seems to help alleviate overfitting and noise (and to not make a difference in the worst case).
get_word_vectors <- function(tokens, glove, iterations, convergence_tol) {
# Compute word vectors
wv_main <- glove$glove_instance$fit_transform(
glove$tcm,
n_iter = iterations,
convergence_tol = convergence_tol)
# Compute context vectors
wv_context <- glove$glove_instance$components
wv_main + t(wv_context)
}
test_word_vectors <- get_word_vectors(
tokens = test_tokens,
glove = test_model,
iterations = 3,
convergence_tol = 0.001)
## INFO [17:18:21.112] epoch 1, loss 0.1226
## INFO [17:18:21.221] epoch 2, loss 0.0972
## INFO [17:18:21.258] epoch 3, loss 0.0907
test_word_vectors[sample(nrow(test_word_vectors), 5), ]
## [,1] [,2] [,3]
## unfassbar -0.3863378 0.06020361 -0.02467850
## gez 0.3217785 0.13338595 0.09811976
## rundfunkgebuehr 0.4990832 -0.03229976 0.15888837
## jahr -0.6718606 0.18209690 0.87011693
## immunitaet 0.1257666 -0.57761629 -0.44245804
This provides us with the embedding vectors per word, from which we calculate the embeddings per document by averaging the embeddings across all words present in the document (weighted by occurrence).
Mathematically speaking, we can achieve this by matrix multiplication:
\[\begin{bmatrix} 1 & 1 & 0 \\ 0 & 1 & 2 \end{bmatrix}\]
\[\begin{bmatrix} \frac{1}{2} & \frac{1}{2} & 0 \\ 0 & \frac{1}{3} & \frac{2}{3} \end{bmatrix}\]
\[ \begin{bmatrix} \frac{1}{2} & \frac{1}{2} & 0 \\ 0 & \frac{1}{3} & \frac{2}{3} \end{bmatrix} \times \begin{bmatrix} 0.34 & 0.11 & 0.52 \\ 0.25 & 0.08 & 0.64 \\ 0.02 & 0.77 & 0.15 \end{bmatrix} \\ = \begin{bmatrix} \frac{1}{2} \cdot 0.34 + \frac{1}{2} \cdot 0.25 + 0 \cdot 0.02 & \frac{1}{2} \cdot 0.11 + \frac{1}{2} \cdot 0.08 + 0 \cdot 0.77 & \frac{1}{2} \cdot 0.52 + \frac{1}{2} \cdot 0.64 + 0 \cdot 0.15 \\ 0 \cdot 0.34 + \frac{1}{3} \cdot 0.25 + \frac{2}{3} \cdot 0.02 & 0 \cdot 0.11 + \frac{1}{3} \cdot 0.08 + \frac{2}{3} \cdot 0.77 & 0 \cdot 0.52 + \frac{1}{3} \cdot 0.64 + \frac{2}{3} \cdot 0.15 \end{bmatrix} \]
\[\begin{bmatrix} 0.295 & 0.095 & 0.580 \\ 0.097 & 0.540 & 0.313 \end{bmatrix}\]
get_doc_embeddings <- function(tokens, word_vectors) {
# Compute document-term matrix
dtm <- quanteda::dfm_match(
quanteda::dfm(tokens),
rownames(word_vectors))
# Modify so rows sum to 1
dtm <- text2vec::normalize(dtm, norm = "l1")
# Perform matrix multiplication
doc_embeddings <- as.matrix(dtm) %*% word_vectors
# Append doc_id info
doc_embeddings <- data.table::as.data.table(doc_embeddings)
doc_embeddings[, doc_id := tokens$doc_id]
doc_embeddings
}
test_doc_embeddings <- get_doc_embeddings(test_tokens, test_word_vectors)
test_doc_embeddings[sample(nrow(test_doc_embeddings), 5), ]
## V1 V2 V3 doc_id
## 1: 0.001636491 -0.07232625 0.14960508 Oliver_Krischer15546483601
## 2: 0.074788185 -0.09397858 0.19964961 sven_kindler15810591001
## 3: -0.159996504 -0.04688173 0.28462818 danielakolbe15812392801
## 4: 0.009321571 -0.01631489 0.09157415 AndrejHunko15662928601
## 5: -0.214895043 0.02756950 0.20538623 tpflueger15647325601
We now collect our sub-routines in one large function we apply to each of our corpus subsets:
make_glove_embeddings <- function(corpus,
stopwords,
term_count_min,
skip_grams_window,
dimension,
x_max,
iterations,
convergence_tol) {
tokens <- tokenize(corpus, stopwords)
vocabulary <- make_vocab(tokens, term_count_min)
glove <- make_glove_instance(vocabulary, skip_grams_window, dimension, x_max)
word_vectors <- get_word_vectors(tokens, glove, iterations, convergence_tol)
doc_embeddings <- get_doc_embeddings(tokens, word_vectors)
doc_embeddings
}
embeddings <- lapply(
twitter_corpus_subsets,
function(i) {make_glove_embeddings(
i,
stopwords = stopwords,
term_count_min = 2,
skip_grams_window = 5,
dimension = 3,
x_max = 10,
iterations = 10,
convergence_tol = 0.001)})
And lastly, we store all embedding vectors in one data.table
we can use for sentiment analysis, looking like this:
\[\begin{bmatrix} \text{embeddings for topic 1} & 0 & 0 & 0\\ 0 & \text{embeddings for topic 2} & 0 & 0 \\ 0 & 0 & \text{embeddings for topic 3} & 0 \\ 0 & 0 & 0 & \text{embeddings for topic 4} \end{bmatrix}\]
# Save doc_id for join
doc_id <- unlist(lapply(embeddings, function(i) i$doc_id))
invisible(lapply(embeddings, function(i) i[, doc_id := NULL]))
# Create block matrix where each document's embedding loadings depend
# on its topic label, but save doc_id info before
embeddings <- lapply(embeddings, as.matrix)
embedding_matrix <- do.call(Matrix::bdiag, embeddings)
# Convert back to data.table
embeddings_dt <- data.table::as.data.table(as.matrix(embedding_matrix))
data.table::setnames(
embeddings_dt,
sprintf("embedding_%d", seq_along(embeddings_dt)))
embeddings_dt[, doc_id := ..doc_id]
# Inspect
embeddings_dt[sample(embeddings_dt[, .I], 10)]
## embedding_1 embedding_2 embedding_3 embedding_4 embedding_5 embedding_6
## 1: 0.000000 0.0000000 0.00000000 0.05482503 0.06856164 -0.01412917
## 2: 0.000000 0.0000000 0.00000000 -0.08276987 -0.05781258 0.04739306
## 3: 0.000000 0.0000000 0.00000000 0.00000000 0.00000000 0.00000000
## 4: 0.000000 0.0000000 0.00000000 0.00000000 0.00000000 0.00000000
## 5: -0.109036 0.2577165 0.05673243 0.00000000 0.00000000 0.00000000
## 6: 0.000000 0.0000000 0.00000000 -0.17533056 0.08600875 -0.20579785
## 7: 0.000000 0.0000000 0.00000000 0.00000000 0.00000000 0.00000000
## 8: -0.371374 0.4012897 -0.04043765 0.00000000 0.00000000 0.00000000
## 9: 0.000000 0.0000000 0.00000000 -0.27099564 0.11642943 0.21945584
## 10: 0.000000 0.0000000 0.00000000 0.01614947 -0.01226427 -0.16079251
## embedding_7 embedding_8 embedding_9 embedding_10 embedding_11 embedding_12
## 1: 0.00000000 0.0000000 0.0000000 0.0000000 0.0000000 0.00000000
## 2: 0.00000000 0.0000000 0.0000000 0.0000000 0.0000000 0.00000000
## 3: 0.08121451 -0.1553581 -0.0612087 0.0000000 0.0000000 0.00000000
## 4: 0.00000000 0.0000000 0.0000000 0.0198986 -0.1740356 0.03903582
## 5: 0.00000000 0.0000000 0.0000000 0.0000000 0.0000000 0.00000000
## 6: 0.00000000 0.0000000 0.0000000 0.0000000 0.0000000 0.00000000
## 7: 0.05419094 0.1526954 -0.1800756 0.0000000 0.0000000 0.00000000
## 8: 0.00000000 0.0000000 0.0000000 0.0000000 0.0000000 0.00000000
## 9: 0.00000000 0.0000000 0.0000000 0.0000000 0.0000000 0.00000000
## 10: 0.00000000 0.0000000 0.0000000 0.0000000 0.0000000 0.00000000
## doc_id
## 1: hahnflo15968008801
## 2: BriHasselmann15670225801
## 3: katjakipping15542043601
## 4: karl_lauterbach15973945201
## 5: kleikert15421870801
## 6: IreneMihalic16021797601
## 7: JM_Luczak15713945401
## 8: NordMdb15880587601
## 9: JuergenBraunAfD15845533201
## 10: victorperli15914375401
# Insert info back into corpus
twitter_docvars <- data.table::as.data.table(
cbind(
doc_id = quanteda::docid(twitter_corpus),
quanteda::docvars(twitter_corpus)),
key = "doc_id")
twitter_docvars <- embeddings_dt[twitter_docvars, on = "doc_id"]
twitter_corpus_with_embeddings <- twitter_corpus
quanteda::docvars(twitter_corpus_with_embeddings) <- as.data.frame(twitter_docvars)
# Save
saveRDS(
twitter_corpus_with_embeddings,
sprintf("%s/twitter_corpus_with_embeddings.RDS", path))
And that’s it! Now we have topic-specific embedding vectors we can use as features in the upcoming sentiment analysis.