Train Sentiment Analysis Classifier and Compute Prediction

This demo is about training a classifier for sentiment analysis and predicting sentiment labels.

We show how to perform sentiment analysis with machine learning methods. More specifically, we will use a logistic regression model to predict sentiment labels in a binary classification task.

The code is based on the mlr3 universe, a unifying framework for a broad variety of tasks in supervised learning that subsumes many other packages and provides consistent syntax. This book provides a great introduction to mlr3. Further examples with applied use cases may be found here.

Note that it is built in R6. Covering object-oriented programming (OOP) is beyond the scope of this course but we will mention some basics. The most important things to know here are:

# Load required packages

library(data.table)
library(quanteda)
library(mlr3verse)

Perform train-test split

As always, we start by reading the data. Now there is one big BUT: we cannot simply take the data with topic embeddings as created in the last demo – if we used these and splitted them into train and test sets, we would see information from the training observations leak into the test part. That is why we called topic labels and embeddings dynamic features.

We want to avoid such bias and proceed as follows:

  1. We split the data without topic labels / embeddings into training and test data.
  2. We extract topics and compute topic-specific embeddings for our training data, the results of which we can then apply to our test data.

We first read the static features:

# path <- ... (individual file location)

twitter_features_static <- readRDS(sprintf("%s/twitter_features_static.RDS", path))

Next, we perform a train-test split:

# Create a binary variable indicating whether the document belongs to training or test data (setting a seed for reproducibility and choosing 70% of data for training)

set.seed(123)

twitter_features_static$is_train <- rbinom(
  n = nrow(twitter_features_static), 
  size = 1, 
  prob = 0.7)

# Convert target variable to a factor

twitter_features_static$label <- as.factor(twitter_features_static$label)

# Perform split

data_train <- twitter_features_static[is_train == 1]
data_test <- twitter_features_static[is_train == 0]

nrow(data_train) + nrow(data_test) == nrow(twitter_features_static)
## [1] TRUE
# Inspect

list(train = head(data_train), test = head(data_test))
## $train
##                  doc_id    label ADJ ADP ADV AUX CCONJ DET INTJ NOUN NUM PART
## 1: ABaerbock15156546001 negative   4   1   3   2     0   4    0    8   0    1
## 2: ABaerbock15216341401 negative   4   4   2   1     0   6    0   10   2    2
## 3: ABaerbock15283779001 negative   2   4   2   2     0   4    0    8   0    2
## 4: ABaerbock15372868201 negative   3   4   8   0     1   5    0    9   0    2
## 5: ABaerbock15485148001 positive   3   4   2   2     1   3    0   10   1    0
## 6: ABaerbock15514725001 positive   0   3   0   1     2   6    0   10   0    0
##    PRON PROPN PUNCT SCONJ VERB X negation  a b  c  d  e f  g  h  i j k  l  m  n
## 1:    2     2     7     0    3 2        1 19 2  6  9 22 3  8  8 14 0 9 10  7 11
## 2:    3     0     7     0    4 0        0 15 1  3 13 40 5 10  4 18 0 6 15  7 17
## 3:    5     3     6     0    9 0        1 20 6  6  6 35 7 12  9 18 0 2 10 10 19
## 4:    4     7     5     0    5 0        1 15 4  9  7 33 3 11 19 15 1 4  5 10 18
## 5:    2     1     4     1    3 0        0 16 4 11  7 32 8  2 12 18 0 2  8  8 15
## 6:    3     6     4     0    3 0        0 17 5  4  5 29 8  4  4 13 0 4  6  6 16
##     o p q  r  s  t  u v w x y z n_emojis positive negative is_train
## 1: 14 4 0 16 15 11  7 0 1 0 1 1        0        3        3        1
## 2: 10 1 0 21 20 13 11 2 2 0 0 2        0        3        4        1
## 3:  3 2 0 11 14 16 12 1 4 0 0 4        0        2        2        1
## 4: 13 0 0 16 14 11 11 2 3 0 1 2        0        8        4        1
## 5:  7 0 0 17 18 17  9 1 5 0 3 3        0        2        1        1
## 6:  4 5 0 15 20 14 12 1 4 1 0 1        0        4        3        1
## 
## $test
##                  doc_id    label ADJ ADP ADV AUX CCONJ DET INTJ NOUN NUM PART
## 1: ABaerbock15210084001 negative   5   0   5   1     1   1    0    2   1    1
## 2: ABaerbock15252349801 negative   5   4   3   1     2   4    0    9   0    1
## 3: ABaerbock15256297201 positive   2   4   1   5     3   2    0   10   0    0
## 4: ABaerbock15435904201 negative   2   3   0   0     2   2    0    7   0    0
## 5: ABaerbock15587025001 positive   4   2   1   1     1   3    0   11   1    0
## 6: ABaerbock15984426601 negative   5   4   0   2     1   5    0    8   1    1
##    PRON PROPN PUNCT SCONJ VERB X negation  a b c  d  e f g  h  i j k  l m  n o
## 1:    1     4     5     1    3 1        1 19 1 5  4 18 4 5  8 11 1 5 11 9  9 8
## 2:    5     0     6     0    3 0        1 10 4 6 10 38 2 4 11 22 0 4  8 3 24 7
## 3:    5     2     6     1    4 0        0 13 4 2 14 29 1 4  8 19 0 8  8 8 24 6
## 4:    3     3     5     1    3 0        0  9 2 7 12 27 0 5  7 12 0 5  7 4 18 6
## 5:    1     7     3     0    0 0        0 18 2 3 11 24 5 5  4  9 2 4 10 4 10 6
## 6:    3     1     4     3    5 0        1 12 2 5 11 40 4 6  9 16 0 5  8 8 20 9
##    p q  r  s  t  u v w x y z n_emojis positive negative is_train
## 1: 3 0 11 12 13  7 1 2 0 0 1        0        1        2        0
## 2: 1 0 16 15 11  8 3 3 0 0 3        0        3        5        0
## 3: 3 0  8 15 11  9 1 8 0 0 1        0        3        0        0
## 4: 2 0  8 12  5  7 1 1 0 0 2        0        0        1        0
## 5: 2 0 12 12 13 12 1 3 0 1 2        0        5        1        0
## 6: 2 0 13 18 18  7 2 3 0 0 2        0        1        2        0

Compute embeddings

Now we compute the topic-specific embeddings. For the sake of clarity we have omitted the actual computation for this demo as things become a bit lengthy at this point code-wise.

The basic idea is to compute topic labels and embeddings for the training data, just as we have seen before (using the same number of topics and dimensions, respectively), and storing the STM and embedding word vectors such that they can be used for prediction at test time. If you are interested in how this implemented, have a look at the source code of this demo.

embedding_values_train <- ...
embedding_values_test <- ...
# Collect all static and dynamic features

data_train <- embedding_values_train[data_train, on = "doc_id"]
data_test <- embedding_values_test[data_test, on = "doc_id"]

Create classification task

Now we are all set for sentiment analysis. The first step is to create a task object:

# Create task object

task <- mlr3::TaskClassif$new("sentiment_analysis", data_train, target = "label")

# Inspect

task
## <TaskClassif:sentiment_analysis> (858 x 61)
## * Target: label
## * Properties: twoclass
## * Features (60):
##   - dbl (57): ADJ, ADP, ADV, AUX, CCONJ, DET, INTJ, NOUN, NUM, PART,
##     PRON, PROPN, PUNCT, SCONJ, VERB, X, a, b, c, d, e, embedding_1,
##     embedding_10, embedding_11, embedding_12, embedding_2, embedding_3,
##     embedding_4, embedding_5, embedding_6, embedding_7, embedding_8,
##     embedding_9, f, g, h, i, j, k, l, m, n, negation, negative, o, p,
##     positive, q, r, s, t, u, v, w, x, y, z
##   - int (2): is_train, n_emojis
##   - chr (1): doc_id

Note that we need to make sure the doc_id variable, which is not a feature but merely an identifier, is exempt from the features.

# Set the role of doc_id to naming variable

task$set_col_roles("doc_id", "name")

Create and train learner

The learner reflects our hypothesis about the feature-target relation.

mlr3 supports a variety of different learning algorithms (convenient look-up via mlr_learners) and even more are available in the mlr3learners extension package. All are instantiated by the same call.

We choose logistic regression here, a generalized version of the linear regression model that is able to predict class probabilities (if the concept seems somewhat distant in your memory, have a look into one of the standard statistics books, or into this one, or check out one of the abundant blogposts and YouTube sources).

The logistic regression model can be stated in different ways. We will often see something of the form

\[\pi({x}_i) = \mathbb{P}(y_i = 1 \rvert {x}_i) = \frac{\exp(\beta_0 + \mathbf{\beta}^T \mathbf{x}_i)}{1 + \exp(\beta_0 + \mathbf{\beta}^T \mathbf{x}_i)} = \frac{\exp(\beta_0 + \beta_1 x_{i, 1} + \dots + \beta_p x_{i, p})}{1 + \exp(\beta_0 + \beta_1 x_{i, 1} + \dots + \beta_p x_{i, p})},\]

specifying the probability of being in class 1 (here: positive class) for a given observation \(\mathbf{x}_i\).

We can see how the so-called log-odds are related to the linear predictor:

\[\beta_0 + \mathbf{\beta}^T \mathbf{x}_i = \log \frac{\pi(\mathbf{x}_i)}{1 - \pi(\mathbf{x}_i)}.\]

So we do not directly model the conditional mean, as in the standard linear model, but a transformed version of it (hence generalized linear modeling). Logistic regression does not require any hyperparameters, allowing us to skip the tuning part, and we can simply invoke a logistic regression learner by:

# Create learner

learner <- mlr3::lrn("classif.log_reg", predict_type = "prob")

learner
## <LearnerClassifLogReg:classif.log_reg>
## * Model: -
## * Parameters: list()
## * Packages: stats
## * Predict Type: prob
## * Feature types: logical, integer, numeric, character, factor, ordered
## * Properties: twoclass, weights

After instantiating the learner object, we do the actual training using our training task. Note how the learner is modified in-place:

# Train learner on training task

learner$train(task)

Evaluate learner

Now is the time for our model to output some predictions on the test set so we can evaluate its performance.

# Compute predictions

data_test$label <- as.factor(data_test$label)
predictions <- learner$predict_newdata(data_test)

# Inspect confusion matrix

list(predictions = predictions, confusion = predictions$confusion)
## $predictions
## <PredictionClassif> for 357 observations:
##     row_ids    truth response prob.negative prob.positive
##           1 negative negative     0.9719697    0.02803029
##           2 negative negative     0.9741098    0.02589016
##           3 positive negative     0.6142915    0.38570852
## ---                                                      
##         355 negative negative     0.9245560    0.07544396
##         356 negative negative     0.8200882    0.17991179
##         357 negative negative     0.9807699    0.01923006
## 
## $confusion
##           truth
## response   negative positive
##   negative      225       37
##   positive       27       68

Our learner looks like it’s doing pretty okay but not extremely accurate (it seems to struggle with the positive observations in particular). Let’s compute some performance metrics (list all available metrics with mlr_measures) to get a better picture:

eval_metrics <- list(
  mlr3::msr("classif.acc"), # accuracy
  mlr3::msr("classif.ppv"), # positive predictive value
  mlr3::msr("classif.fbeta") # f1 score
)

predictions$score(eval_metrics)
##   classif.acc   classif.ppv classif.fbeta 
##     0.8207283     0.8587786     0.8754864

We can also plot the associated ROC curve:

mlr3viz::autoplot(predictions, type = "roc") +
  ggplot2::ggtitle("ROC curve for Twitter sentiment analysis")

All in all, we seem to have found a fairly decent model. It does not exactly achieve exciting performance, but the task at hand is actually a hard one (keep in mind that we try to predict the author’s sentiment by such simple things as the amount of certain letters in the text).

In a real application model selection does not stop here; we would compare and benchmark multiple learners against each other to find the best one for our task. For now, though, we stick with our logistic regression model.

Fit final model

By splitting the data into train and test sets we deliberately forgo parts of the data for training (rule of thumb: more data, better model) so we had some spare observations for evaluation. This allowed us to obtain a (slightly pessimistic) performance estimation.

In the end we would typically train the learner on the entire data set and use the resulting, final model for predictions of future unseen data:

# Create task object

data_total <- rbind(data_train, data_test)
task_total <- mlr3::TaskClassif$new("sentiment_analysis_final", data_total, target = "label")
task_total$set_col_roles("doc_id", "name")

# Train learner

learner$train(task_total)

# Store model

final_model <- learner$model

# Inspect

summary(final_model)
## 
## Call:
## stats::glm(formula = task$formula(), family = "binomial", data = task$data(), 
##     model = FALSE)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3502  -0.5556  -0.2677   0.3323   3.0378  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    0.736355   0.457862   1.608  0.10778    
## ADJ            0.154547   0.074519   2.074  0.03809 *  
## ADP           -0.004500   0.065215  -0.069  0.94499    
## ADV           -0.146338   0.062848  -2.328  0.01989 *  
## AUX           -0.059479   0.089551  -0.664  0.50657    
## CCONJ          0.123904   0.095774   1.294  0.19577    
## DET            0.089241   0.056785   1.572  0.11605    
## INTJ         -12.975043 548.013772  -0.024  0.98111    
## NOUN           0.038586   0.068580   0.563  0.57368    
## NUM            0.129087   0.144377   0.894  0.37127    
## PART          -0.125655   0.165315  -0.760  0.44720    
## PRON           0.225313   0.069694   3.233  0.00123 ** 
## PROPN         -0.060907   0.060879  -1.000  0.31708    
## PUNCT         -0.036091   0.064347  -0.561  0.57488    
## SCONJ          0.230787   0.159162   1.450  0.14705    
## VERB          -0.135904   0.075603  -1.798  0.07224 .  
## X              0.171143   0.160072   1.069  0.28500    
## a             -0.011001   0.031026  -0.355  0.72291    
## b              0.015416   0.050982   0.302  0.76236    
## c             -0.024741   0.056392  -0.439  0.66086    
## d             -0.067081   0.038529  -1.741  0.08168 .  
## e             -0.024380   0.022452  -1.086  0.27755    
## embedding_1    1.695805   3.279168   0.517  0.60505    
## embedding_10  -2.722780   2.462974  -1.105  0.26895    
## embedding_11   0.061367   1.785801   0.034  0.97259    
## embedding_12   1.889366   1.805044   1.047  0.29523    
## embedding_2   -0.882042   3.500844  -0.252  0.80108    
## embedding_3   -4.979378   2.923026  -1.704  0.08847 .  
## embedding_4   -2.456096   1.272871  -1.930  0.05366 .  
## embedding_5   -1.524548   0.893319  -1.707  0.08789 .  
## embedding_6    0.425659   0.889883   0.478  0.63241    
## embedding_7    1.351531   0.868545   1.556  0.11969    
## embedding_8    0.489600   0.494841   0.989  0.32246    
## embedding_9    0.919663   0.837002   1.099  0.27187    
## f             -0.008689   0.042763  -0.203  0.83898    
## g              0.002247   0.037443   0.060  0.95215    
## h              0.047814   0.047952   0.997  0.31871    
## i             -0.026396   0.028558  -0.924  0.35533    
## is_train      -0.019773   0.190301  -0.104  0.91725    
## j              0.116936   0.145996   0.801  0.42316    
## k              0.046182   0.054554   0.847  0.39725    
## l             -0.046684   0.034857  -1.339  0.18047    
## m             -0.080040   0.039583  -2.022  0.04317 *  
## n              0.008490   0.025334   0.335  0.73753    
## n_emojis       0.347958   0.189122   1.840  0.06579 .  
## negation      -0.056512   0.199316  -0.284  0.77677    
## negative      -0.744346   0.081880  -9.091  < 2e-16 ***
## o             -0.026371   0.040693  -0.648  0.51696    
## p             -0.014674   0.060100  -0.244  0.80711    
## positive       0.488522   0.053121   9.196  < 2e-16 ***
## q             -0.806195   0.402424  -2.003  0.04514 *  
## r             -0.003347   0.028822  -0.116  0.90755    
## s             -0.049773   0.026108  -1.906  0.05660 .  
## t              0.034772   0.026910   1.292  0.19630    
## u              0.000367   0.034832   0.011  0.99159    
## v             -0.047429   0.072671  -0.653  0.51398    
## w             -0.012511   0.052807  -0.237  0.81272    
## x              0.134343   0.223279   0.602  0.54739    
## y              0.227321   0.188087   1.209  0.22682    
## z              0.051697   0.062476   0.827  0.40797    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1430.94  on 1214  degrees of freedom
## Residual deviance:  868.97  on 1155  degrees of freedom
## AIC: 988.97
## 
## Number of Fisher Scoring iterations: 13

And that’s it for a first jump at sentiment analysis! We have performed a train-test split of our data, trained a logistic regression learner on the training task, used the resulting model to compute predictions for the test data to evaluate performance, and trained the final model we could now use for future sentiment analysis.