Compute Topic-Specific Word Embeddings with Global Vectors (GloVe)

This demo is about computing topic-specific word embeddings with GloVe.

We show how to compute word embeddings via GloVe for a corpus segmented into multiple topics, where the embedding part is of course applicable to a non-partitioned corpus as well.

The code is based on the text2vec package, much of which is built in R6. Covering object-oriented programming is beyond the scope of this course, but we will mention some basics during sentiment analysis with mlr3.

# Load required packages

library(Matrix)
library(quanteda)
library(text2vec)

Subset corpus

We start by reading the data we have created during the topic modeling process:

# path <- ... (individual file location)

twitter_corpus <- readRDS(sprintf("%s/twitter_corpus_with_topics.RDS", path))

Next, we subset the corpus by topic label:

# Recode NA labels

twitter_corpus$topic_label <- ifelse(
  is.na(twitter_corpus$topic_label),
  99, 
  twitter_corpus$topic_label)

# Subset corpus

twitter_corpus_subsets <- lapply(
  unique(twitter_corpus$topic_label),
  function(i) quanteda::corpus_subset(twitter_corpus, topic_label == i))

sapply(twitter_corpus_subsets, quanteda::ndoc)
## [1] 498 237 423  57

Define function to compute embeddings

We will now define the embedding procedure and encapsulate it in a function we can then apply to each corpus subset. It is typically advisable to break down such routines into sub-functions (top-down programming). This keeps the code readable and is much easier to debug.

Our embedding function should have the following structure (regard this as pseudo-code):

make_glove_embeddings <- function(text, stopwords, glove_args) {
  
  tokens <- tokenize(text, stopwords)
  vocabulary <- make_vocab(tokens, glove_args)
  glove <- make_glove_instance(vocabulary, glove_args)
  word_vectors <- get_word_vectors(tokens, glove, glove_args)
  doc_embeddings <- get_doc_embeddings(tokens, word_vectors)
  
  doc_embeddings
  
}

Tokenization feels pretty been-there-done-that by now:

tokenize <- function(text, stopwords) {
  
  tokens <- quanteda::tokens(
    text,
    what = "word",
    remove_symbols = TRUE,
    remove_punct = TRUE,
    remove_numbers = TRUE,
    remove_separators = TRUE,
    split_hyphens = TRUE,
    include_docvars = TRUE)
  
  tokens <- quanteda::tokens_wordstem(tokens, language = "german")
  
  tokens <- quanteda::tokens_remove(
    quanteda::tokens_tolower(tokens),
    pattern = stopwords)
  
  tokens <- quanteda::tokens_select(tokens, min_nchar = 3)
  
  tokens
  
}

stopwords <- readRDS(sprintf("%s/stopwords.RDS", path))
test_tokens <- tokenize(twitter_corpus, stopwords)
test_tokens[1:2]
## Tokens consisting of 2 documents and 20 docvars.
## ABaerbock15156546001 :
##  [1] "gro"             "sondierungspapi" "klima"           "nochmal"        
##  [5] "genau"           "angeschaut"      "krass"           "facto"          
##  [9] "sogar"           "kyoto"           "protokoll"       "meilenstein"    
## [ ... and 9 more ]
## 
## ABaerbock15210084001 :
##  [1] "weltweit"    "sieht"       "angela"      "merkel"      "groko"      
##  [6] "nicht"       "klimataug"   "sond"        "vielmehr"    "paris"      
## [11] "abkomm"      "unterlaeuft"
## [ ... and 5 more ]

The next step is to create a vocabulary. This is where text2vec first comes into play:

make_vocab <- function(tokens, term_count_min) {
  
  # Convert tokens to list
  
  tokens <- as.list(tokens)
  
  # Create iterator (kind of list from which vocabulary can be created)
  
  itokens <- text2vec::itoken(tokens, progressbar = FALSE)
  
  # Create vocabulary and retain only words of a certain frequency
      
  vocab <- text2vec::create_vocabulary(itokens)
  vocab <- text2vec::prune_vocabulary(vocab, term_count_min = term_count_min)
  
  list(itokens = itokens, vocab = vocab)
  
}

test_vocab <- make_vocab(test_tokens, term_count_min = 2)
summary(test_vocab$vocab)
##      term             term_count        doc_count      
##  Length:2913        Min.   :  2.000   Min.   :  1.000  
##  Class :character   1st Qu.:  2.000   1st Qu.:  2.000  
##  Mode  :character   Median :  3.000   Median :  3.000  
##                     Mean   :  6.332   Mean   :  6.038  
##                     3rd Qu.:  6.000   3rd Qu.:  6.000  
##                     Max.   :442.000   Max.   :358.000

Let’s now create an instance of the GloVe class (if you are not familiar with object-oriented programming, imagine this as some kind of abstract machine we can kick-start to compute our embeddings).

This commands the specification of some additional arguments:

  • the width of the window for neighboring words skip_grams_window, which we will later set rather small as our texts are super short,
  • the number of embedding dimensions dimension, arguably the most important hyperparameter, and
  • maximum frequency of co-occurrence x_max, controlling the GloVe-inherent weighting function (not that critical for us as we have only very few words with multiple co-occurrences)
make_glove_instance <- function(vocabulary, skip_grams_window, dimension, x_max) {
  
  vect <- text2vec::vocab_vectorizer(vocabulary$vocab)
  
  tcm <- text2vec::create_tcm(
    vocabulary$itokens, 
    vect, 
    skip_grams_window = skip_grams_window) 
  
  glove_instance <- text2vec::GlobalVectors$new(rank = dimension, x_max = x_max)
  
  list(tcm = tcm, glove_instance = glove_instance)
  
}

test_model <- make_glove_instance(
  test_vocab, 
  skip_grams_window = 5, 
  dimension = 3, 
  x_max = 5)

With this somewhat hard-to-grasp container we can fit the GloVe (no bad jokes intended).

Again, there are some (rather technical) hyperparameters to set:

  • iterations – how many iterations do we allow for the optimization process?
  • convergence_tol – how large is our tolerance in declaring convergence (the process will terminate if either convergence is reached or the number of iterations is exhausted)?

Besides, we note that the word vectors are composed of the actual word vectors, here called wv_main, and some “context” vectors wv_context. The original paper is somewhat cryptic about the purpose of this double computation but it seems to help alleviate overfitting and noise (and to not make a difference in the worst case).

get_word_vectors <- function(tokens, glove, iterations, convergence_tol) {
  
  # Compute word vectors
  
  wv_main <- glove$glove_instance$fit_transform(
    glove$tcm, 
    n_iter = iterations, 
    convergence_tol = convergence_tol)  
  
  # Compute context vectors
  
  wv_context <- glove$glove_instance$components
  
  wv_main + t(wv_context)
  
}

test_word_vectors <- get_word_vectors(
  tokens = test_tokens,
  glove = test_model,
  iterations = 3,
  convergence_tol = 0.001)
## INFO  [17:18:21.112] epoch 1, loss 0.1226 
## INFO  [17:18:21.221] epoch 2, loss 0.0972 
## INFO  [17:18:21.258] epoch 3, loss 0.0907
test_word_vectors[sample(nrow(test_word_vectors), 5), ]
##                       [,1]        [,2]        [,3]
## unfassbar       -0.3863378  0.06020361 -0.02467850
## gez              0.3217785  0.13338595  0.09811976
## rundfunkgebuehr  0.4990832 -0.03229976  0.15888837
## jahr            -0.6718606  0.18209690  0.87011693
## immunitaet       0.1257666 -0.57761629 -0.44245804

This provides us with the embedding vectors per word, from which we calculate the embeddings per document by averaging the embeddings across all words present in the document (weighted by occurrence).

Mathematically speaking, we can achieve this by matrix multiplication:

  • First, get the document-term matrix counting each word’s occurrences per document (below is an example for 2 documents and 3 terms).

\[\begin{bmatrix} 1 & 1 & 0 \\ 0 & 1 & 2 \end{bmatrix}\]

  • Modify the entrances so they sum to 1 in each row – these form the weights for our average.

\[\begin{bmatrix} \frac{1}{2} & \frac{1}{2} & 0 \\ 0 & \frac{1}{3} & \frac{2}{3} \end{bmatrix}\]

  • Multiply each row by each column of the word vectors, effectively amounting to taking the average, by matrix multiplication.

\[ \begin{bmatrix} \frac{1}{2} & \frac{1}{2} & 0 \\ 0 & \frac{1}{3} & \frac{2}{3} \end{bmatrix} \times \begin{bmatrix} 0.34 & 0.11 & 0.52 \\ 0.25 & 0.08 & 0.64 \\ 0.02 & 0.77 & 0.15 \end{bmatrix} \\ = \begin{bmatrix} \frac{1}{2} \cdot 0.34 + \frac{1}{2} \cdot 0.25 + 0 \cdot 0.02 & \frac{1}{2} \cdot 0.11 + \frac{1}{2} \cdot 0.08 + 0 \cdot 0.77 & \frac{1}{2} \cdot 0.52 + \frac{1}{2} \cdot 0.64 + 0 \cdot 0.15 \\ 0 \cdot 0.34 + \frac{1}{3} \cdot 0.25 + \frac{2}{3} \cdot 0.02 & 0 \cdot 0.11 + \frac{1}{3} \cdot 0.08 + \frac{2}{3} \cdot 0.77 & 0 \cdot 0.52 + \frac{1}{3} \cdot 0.64 + \frac{2}{3} \cdot 0.15 \end{bmatrix} \]

  • … yielding the matrix of documents x embedding vectors:

\[\begin{bmatrix} 0.295 & 0.095 & 0.580 \\ 0.097 & 0.540 & 0.313 \end{bmatrix}\]

get_doc_embeddings <- function(tokens, word_vectors) {

  # Compute document-term matrix
  
  dtm <- quanteda::dfm_match(
    quanteda::dfm(tokens),
    rownames(word_vectors))
  
  # Modify so rows sum to 1
  
  dtm <- text2vec::normalize(dtm, norm = "l1")
  
  # Perform matrix multiplication
  
  doc_embeddings <- as.matrix(dtm) %*% word_vectors
  
  # Append doc_id info
  
  doc_embeddings <- data.table::as.data.table(doc_embeddings)
  doc_embeddings[, doc_id := tokens$doc_id]
      
  doc_embeddings
  
}

test_doc_embeddings <- get_doc_embeddings(test_tokens, test_word_vectors)

test_doc_embeddings[sample(nrow(test_doc_embeddings), 5), ]
##              V1          V2         V3                     doc_id
## 1:  0.001636491 -0.07232625 0.14960508 Oliver_Krischer15546483601
## 2:  0.074788185 -0.09397858 0.19964961    sven_kindler15810591001
## 3: -0.159996504 -0.04688173 0.28462818    danielakolbe15812392801
## 4:  0.009321571 -0.01631489 0.09157415     AndrejHunko15662928601
## 5: -0.214895043  0.02756950 0.20538623       tpflueger15647325601

Compute embeddings

We now collect our sub-routines in one large function we apply to each of our corpus subsets:

make_glove_embeddings <- function(corpus, 
                                  stopwords, 
                                  term_count_min,
                                  skip_grams_window, 
                                  dimension,
                                  x_max,
                                  iterations, 
                                  convergence_tol) {
  
  tokens <- tokenize(corpus, stopwords)
  vocabulary <- make_vocab(tokens, term_count_min)
  glove <- make_glove_instance(vocabulary, skip_grams_window, dimension, x_max)
  word_vectors <- get_word_vectors(tokens, glove, iterations, convergence_tol)
  doc_embeddings <- get_doc_embeddings(tokens, word_vectors)
  
  doc_embeddings
  
}

embeddings <- lapply(
  twitter_corpus_subsets,
  function(i) {make_glove_embeddings(
    i,
    stopwords = stopwords,
    term_count_min = 2,
    skip_grams_window = 5,
    dimension = 3,
    x_max = 10,
    iterations = 10,
    convergence_tol = 0.001)})

And lastly, we store all embedding vectors in one data.table we can use for sentiment analysis, looking like this:

\[\begin{bmatrix} \text{embeddings for topic 1} & 0 & 0 & 0\\ 0 & \text{embeddings for topic 2} & 0 & 0 \\ 0 & 0 & \text{embeddings for topic 3} & 0 \\ 0 & 0 & 0 & \text{embeddings for topic 4} \end{bmatrix}\]

# Save doc_id for join
      
doc_id <- unlist(lapply(embeddings, function(i) i$doc_id))
invisible(lapply(embeddings, function(i) i[, doc_id := NULL]))

# Create block matrix where each document's embedding loadings depend
# on its topic label, but save doc_id info before

embeddings <- lapply(embeddings, as.matrix)

embedding_matrix <- do.call(Matrix::bdiag, embeddings)

# Convert back to data.table

embeddings_dt <- data.table::as.data.table(as.matrix(embedding_matrix))
data.table::setnames(
  embeddings_dt,
  sprintf("embedding_%d", seq_along(embeddings_dt)))

embeddings_dt[, doc_id := ..doc_id]

# Inspect

embeddings_dt[sample(embeddings_dt[, .I], 10)]
##     embedding_1 embedding_2 embedding_3 embedding_4 embedding_5 embedding_6
##  1:    0.000000   0.0000000  0.00000000  0.05482503  0.06856164 -0.01412917
##  2:    0.000000   0.0000000  0.00000000 -0.08276987 -0.05781258  0.04739306
##  3:    0.000000   0.0000000  0.00000000  0.00000000  0.00000000  0.00000000
##  4:    0.000000   0.0000000  0.00000000  0.00000000  0.00000000  0.00000000
##  5:   -0.109036   0.2577165  0.05673243  0.00000000  0.00000000  0.00000000
##  6:    0.000000   0.0000000  0.00000000 -0.17533056  0.08600875 -0.20579785
##  7:    0.000000   0.0000000  0.00000000  0.00000000  0.00000000  0.00000000
##  8:   -0.371374   0.4012897 -0.04043765  0.00000000  0.00000000  0.00000000
##  9:    0.000000   0.0000000  0.00000000 -0.27099564  0.11642943  0.21945584
## 10:    0.000000   0.0000000  0.00000000  0.01614947 -0.01226427 -0.16079251
##     embedding_7 embedding_8 embedding_9 embedding_10 embedding_11 embedding_12
##  1:  0.00000000   0.0000000   0.0000000    0.0000000    0.0000000   0.00000000
##  2:  0.00000000   0.0000000   0.0000000    0.0000000    0.0000000   0.00000000
##  3:  0.08121451  -0.1553581  -0.0612087    0.0000000    0.0000000   0.00000000
##  4:  0.00000000   0.0000000   0.0000000    0.0198986   -0.1740356   0.03903582
##  5:  0.00000000   0.0000000   0.0000000    0.0000000    0.0000000   0.00000000
##  6:  0.00000000   0.0000000   0.0000000    0.0000000    0.0000000   0.00000000
##  7:  0.05419094   0.1526954  -0.1800756    0.0000000    0.0000000   0.00000000
##  8:  0.00000000   0.0000000   0.0000000    0.0000000    0.0000000   0.00000000
##  9:  0.00000000   0.0000000   0.0000000    0.0000000    0.0000000   0.00000000
## 10:  0.00000000   0.0000000   0.0000000    0.0000000    0.0000000   0.00000000
##                         doc_id
##  1:         hahnflo15968008801
##  2:   BriHasselmann15670225801
##  3:    katjakipping15542043601
##  4: karl_lauterbach15973945201
##  5:        kleikert15421870801
##  6:    IreneMihalic16021797601
##  7:       JM_Luczak15713945401
##  8:         NordMdb15880587601
##  9: JuergenBraunAfD15845533201
## 10:     victorperli15914375401
# Insert info back into corpus

twitter_docvars <- data.table::as.data.table(
  cbind(
    doc_id = quanteda::docid(twitter_corpus),
    quanteda::docvars(twitter_corpus)), 
  key = "doc_id")

twitter_docvars <- embeddings_dt[twitter_docvars, on = "doc_id"]
twitter_corpus_with_embeddings <- twitter_corpus

quanteda::docvars(twitter_corpus_with_embeddings) <- as.data.frame(twitter_docvars)

# Save

saveRDS(
  twitter_corpus_with_embeddings, 
  sprintf("%s/twitter_corpus_with_embeddings.RDS", path))

And that’s it! Now we have topic-specific embedding vectors we can use as features in the upcoming sentiment analysis.