Extracting Static Features

This demo is about static feature extraction.

We show how to extract several features deemed useful for sentiment analysis. Static refers to these features being independent across documents (unlike, for instance, word embeddings).

# Load required packages

library(data.table)
library(quanteda)
library(stringi)
library(stringr)

Scope

We will extract a set of static features that have been shown to work well for sentiment analysis in general, including:

  • Polarity clues
  • Number of emojis
  • Negations
  • Character unigrams
  • POS tags

Create tokens object

The following analyses all revolve around the presence/absence or number of specific text tokens across documents. Therefore, we convert our corpus object from before to a tokens, where documents are represented by single tokens rather than a fluent text, and along the way omit casing.

# path <- ... (individual file location)

twitter_corpus <- readRDS(sprintf("%s/twitter_corpus.RDS", path))
twitter_tokens <- quanteda::tokens(
  twitter_corpus,
  what = "word", # take tokens to be words
  remove_symbols = TRUE, # in case we forgot anything earlier
  remove_numbers = TRUE, # not relevant
  remove_separators = TRUE, # not relevant
  remove_punct = TRUE, # not relevant
  split_hyphens = TRUE, # sometimes useful with German language
  include_docvars = TRUE) # keep additional variables

twitter_tokens <- quanteda::tokens_tolower(twitter_tokens)

On the occasion, we perform stemming on our tokens to increase their congruency across documents (remember, we want our documents to be represented by as many common tokens as possible). quanteda has a built-in functionality that in turn calls SnowballC’s wordStem function:

twitter_tokens <- quanteda::tokens_wordstem(
  twitter_tokens, 
  language = "german")

twitter_tokens
## Tokens consisting of 1,215 documents and 16 docvars.
## ABaerbock15156546001 :
##  [1] "hab"             "mir"             "das"             "gro"            
##  [5] "ko"              "sondierungspapi" "zu"              "klima"          
##  [9] "nochmal"         "genau"           "angeschaut"      "krass"          
## [ ... and 22 more ]
## 
## ABaerbock15210084001 :
##  [1] "auch"      "weltweit"  "sieht"     "man"       "dass"      "angela"   
##  [7] "merkel"    "s"         "groko"     "nicht"     "klimataug" "ist"      
## [ ... and 13 more ]
## 
## ABaerbock15216341401 :
##  [1] "der"               "groesst"           "global"           
##  [4] "herausforder"      "der"               "klimakris"        
##  [7] "sek"               "von"               "std"              
## [10] "regierungserklaer" "zu"                "widm"             
## [ ... and 24 more ]
## 
## ABaerbock15252349801 :
##  [1] "wir"           "brauch"        "ein"           "and"          
##  [5] "verkehrspolit" "weg"           "von"           "benzin"       
##  [9] "und"           "diesel"        "hin"           "zu"           
## [ ... and 25 more ]
## 
## ABaerbock15256297201 :
##  [1] "das"        "ist"        "die"        "leistung"   "von"       
##  [6] "hundert"    "wahlkaempf" "und"        "polit"      "im"        
## [11] "land"       "die"       
## [ ... and 27 more ]
## 
## ABaerbock15283779001 :
##  [1] "luis"            "amtsberg"        "hat"             "heut"           
##  [5] "all"             "gesagt"          "was"             "es"             
##  [9] "zum"             "familiennachzug" "zu"              "sag"            
## [ ... and 28 more ]
## 
## [ reached max_ndoc ... 1,209 more documents ]

We quickly spot some tokens that do not seem to be of much help (e.g., “zu”). This is where stopwords come in. We define a list of such stopwords and have quanteda remove them from our tokens.

However, we need to account for the fact that we have done quite a bit of text manipulation so far – in particular, we have removed umlauts from our tokens and reduced them to their word stem. In order to make sure our stopwords are matched correctly, we should give them the same treatment.

In fact, we will do this cleaning-stemming transformation more than once, so we encapsulate it in a small function we can call again later:

clean_and_stem <- function(text) {
  
  # Convert to uniform encoding
  
  text <-  stringi::stri_trans_general(text, "Any-Latin")
  
  # Replace umlauts and litigate s
  
  text <- stringr::str_replace_all(
    text,
      c("\u00c4" = "Ae",
      "\u00e4" = "ae",
      "\u00d6" = "Oe",
      "\u00f6" = "oe",
      "\u00dc" = "Ue",
      "\u00fc" = "ue",
      "\u00df" = "ss"))
  
  # Stem
  
  text <- SnowballC::wordStem(text)
  
  text

}

clean_and_stem(c("Döner", "Bereicherung", "kulinarischer", "Angebote"))
## [1] "Doener"       "Bereicherung" "kulinarisch"  "Angebot"

Seems to do the job (but also note the limitations of stemming).

We create a clean stopwords list, which we again wrap in a function we can re-use:

# Create stopwords list

stopwords <- clean_and_stem(quanteda::stopwords(language = "de"))

# Remove potential duplicates and empty instances induced by stemming

stopwords <- unique(stopwords)
stopwords <- stopwords[nchar(stopwords) > 0]

# Inspect

sort(stopwords)
##   [1] "aber"      "al"        "all"       "allem"     "allen"     "aller"    
##   [7] "also"      "am"        "an"        "ander"     "anderem"   "anderen"  
##  [13] "anderm"    "andern"    "anderr"    "au"        "auch"      "auf"      
##  [19] "bei"       "bi"        "bin"       "bist"      "da"        "damit"    
##  [25] "dann"      "dass"      "dasselb"   "dazu"      "de"        "dein"     
##  [31] "deinem"    "deinen"    "deiner"    "dem"       "demselben" "den"      
##  [37] "denn"      "denselben" "der"       "derer"     "derselb"   "derselben"
##  [43] "desselben" "dessen"    "di"        "dich"      "die"       "dies"     
##  [49] "dieselb"   "dieselben" "diesem"    "diesen"    "dieser"    "dir"      
##  [55] "doch"      "dort"      "du"        "durch"     "e"         "ein"      
##  [61] "einem"     "einen"     "einer"     "einig"     "einigem"   "einigen"  
##  [67] "einmal"    "er"        "etwa"      "euch"      "euer"      "eur"      
##  [73] "eurem"     "euren"     "eurer"     "fuer"      "gegen"     "gewesen"  
##  [79] "hab"       "habe"      "haben"     "hat"       "hatt"      "hatten"   
##  [85] "hier"      "hin"       "hinter"    "ich"       "ihm"       "ihn"      
##  [91] "ihnen"     "ihr"       "ihrem"     "ihren"     "ihrer"     "im"       
##  [97] "in"        "indem"     "ist"       "jede"      "jedem"     "jeden"    
## [103] "jeder"     "jene"      "jenem"     "jenen"     "jener"     "jetzt"    
## [109] "kann"      "kein"      "keinem"    "keinen"    "keiner"    "koennen"  
## [115] "koennt"    "machen"    "man"       "manch"     "manchem"   "manchen"  
## [121] "mancher"   "mein"      "meinem"    "meinen"    "meiner"    "mich"     
## [127] "mir"       "mit"       "muss"      "musst"     "nach"      "nicht"    
## [133] "noch"      "nun"       "nur"       "ob"        "oder"      "ohn"      
## [139] "sehr"      "sein"      "seinem"    "seinen"    "seiner"    "selbst"   
## [145] "sich"      "sie"       "sind"      "so"        "solch"     "solchem"  
## [151] "solchen"   "solcher"   "soll"      "sollt"     "sondern"   "sonst"    
## [157] "ueber"     "um"        "un"        "und"       "uns"       "unsem"    
## [163] "unsen"     "unser"     "unter"     "viel"      "vom"       "von"      
## [169] "vor"       "wa"        "waehrend"  "war"       "waren"     "warst"    
## [175] "weg"       "weil"      "weiter"    "welch"     "welchem"   "welchen"  
## [181] "welcher"   "wenn"      "werd"      "werden"    "wie"       "wieder"   
## [187] "will"      "wir"       "wird"      "wirst"     "wo"        "wollen"   
## [193] "wollt"     "wuerd"     "wuerden"   "zu"        "zum"       "zur"      
## [199] "zwar"      "zwischen"

Words indicating negation seem too important to discard for the sentiment analysis task. We cross them from the stopwords list and remove the remaining stopwords from our tokens:

# Deselect negation words

stopwords <- stringr::str_remove_all(
  stopwords,
  "kein(.)*|nicht")

# Remove from tokens

twitter_tokens <- quanteda::tokens_remove(twitter_tokens, stopwords)

# Inspect

twitter_tokens
## Tokens consisting of 1,215 documents and 16 docvars.
## ABaerbock15156546001 :
##  [1] "das"             "gro"             "ko"              "sondierungspapi"
##  [5] "klima"           "nochmal"         "genau"           "angeschaut"     
##  [9] "krass"           "facto"           "sogar"           "das"            
## [ ... and 13 more ]
## 
## ABaerbock15210084001 :
##  [1] "weltweit"  "sieht"     "angela"    "merkel"    "s"         "groko"    
##  [7] "nicht"     "klimataug" "sond"      "vielmehr"  "das"       "paris"    
## [ ... and 7 more ]
## 
## ABaerbock15216341401 :
##  [1] "groesst"           "global"            "herausforder"     
##  [4] "klimakris"         "sek"               "std"              
##  [7] "regierungserklaer" "widm"              "unterstreicht"    
## [10] "groesst"           "leerstell"         "koalitionsvertrag"
## [ ... and 10 more ]
## 
## ABaerbock15252349801 :
##  [1] "brauch"        "and"           "verkehrspolit" "benzin"       
##  [5] "diesel"        "emissionsarm"  "mobilitaet"    "das"          
##  [9] "sozial"        "frag"          "unt"           "schlecht"     
## [ ... and 8 more ]
## 
## ABaerbock15256297201 :
##  [1] "das"        "leistung"   "hundert"    "wahlkaempf" "polit"     
##  [6] "land"       "drauss"     "gezeigt"    "dicht"      "was"       
## [11] "mensch"     "umtreibt"  
## [ ... and 8 more ]
## 
## ABaerbock15283779001 :
##  [1] "luis"            "amtsberg"        "heut"            "gesagt"         
##  [5] "was"             "es"              "familiennachzug" "sag"            
##  [9] "gibt"            "wer"             "integration"     "gefluechtet"    
## [ ... and 14 more ]
## 
## [ reached max_ndoc ... 1,209 more documents ]

Apparently, some very short word stumps result from our cleaning operations, which we will also discard:

twitter_tokens <- quanteda::tokens_select(twitter_tokens, min_nchar = 3)
twitter_tokens <- quanteda::tokens_remove(
  twitter_tokens, 
  c("der", "die", "das", "was", "wer", "wie", "ich", "sie", "wir", "ihr"))

twitter_tokens
## Tokens consisting of 1,215 documents and 16 docvars.
## ABaerbock15156546001 :
##  [1] "gro"             "sondierungspapi" "klima"           "nochmal"        
##  [5] "genau"           "angeschaut"      "krass"           "facto"          
##  [9] "sogar"           "kyoto"           "protokoll"       "meilenstein"    
## [ ... and 9 more ]
## 
## ABaerbock15210084001 :
##  [1] "weltweit"    "sieht"       "angela"      "merkel"      "groko"      
##  [6] "nicht"       "klimataug"   "sond"        "vielmehr"    "paris"      
## [11] "abkomm"      "unterlaeuft"
## [ ... and 5 more ]
## 
## ABaerbock15216341401 :
##  [1] "groesst"           "global"            "herausforder"     
##  [4] "klimakris"         "sek"               "std"              
##  [7] "regierungserklaer" "widm"              "unterstreicht"    
## [10] "groesst"           "leerstell"         "koalitionsvertrag"
## [ ... and 9 more ]
## 
## ABaerbock15252349801 :
##  [1] "brauch"        "and"           "verkehrspolit" "benzin"       
##  [5] "diesel"        "emissionsarm"  "mobilitaet"    "sozial"       
##  [9] "frag"          "unt"           "schlecht"      "luft"         
## [ ... and 7 more ]
## 
## ABaerbock15256297201 :
##  [1] "leistung"   "hundert"    "wahlkaempf" "polit"      "land"      
##  [6] "drauss"     "gezeigt"    "dicht"      "mensch"     "umtreibt"  
## [11] "konkret"    "beweis"    
## [ ... and 6 more ]
## 
## ABaerbock15283779001 :
##  [1] "luis"            "amtsberg"        "heut"            "gesagt"         
##  [5] "familiennachzug" "sag"             "gibt"            "integration"    
##  [9] "gefluechtet"     "perspektiv"      "zusammenleb"     "famili"         
## [ ... and 9 more ]
## 
## [ reached max_ndoc ... 1,209 more documents ]

There is still a variety of tokens we could probably remove (there are more extensive stopwords lists to be found online, the built-in one from quanteda is not too good tbh). In a real application you should dedicate some time to this sub-task to make sure your tokens representation is not contaminated by such noise.

Create dfm object

Now we have characterized our documents by tokens, we want to put some numbers on it: how often do the present tokens occur in each document?

This is where the document-feature matrix (dfm) comes in:

twitter_dfm <- quanteda::dfm(twitter_tokens)

twitter_dfm
## Document-feature matrix of: 1,215 documents, 7,583 features (99.8% sparse) and 16 docvars.
##                       features
## docs                   gro sondierungspapi klima nochmal genau angeschaut krass
##   ABaerbock15156546001   1               1     1       1     1          1     1
##   ABaerbock15210084001   0               0     0       0     0          0     0
##   ABaerbock15216341401   0               0     0       0     0          0     0
##   ABaerbock15252349801   0               0     0       0     0          0     0
##   ABaerbock15256297201   0               0     0       0     0          0     0
##   ABaerbock15283779001   0               0     0       0     1          0     0
##                       features
## docs                   facto sogar kyoto
##   ABaerbock15156546001     1     1     1
##   ABaerbock15210084001     0     0     0
##   ABaerbock15216341401     0     0     0
##   ABaerbock15252349801     0     0     0
##   ABaerbock15256297201     0     0     0
##   ABaerbock15283779001     0     0     0
## [ reached max_ndoc ... 1,209 more documents, reached max_nfeat ... 7,573 more features ]
saveRDS(twitter_dfm, sprintf("%s/twitter_dfm.RDS", path))

Find polarity clues

Time for some feature extraction at last.

The first static feature we will extract is the number of (positive or negative) polarity clues in the data.

Polarity clues are sentiment-bearing words like “grauenhaft”. There is a variety of open-source lists for these; we will use the Global Polarity Clues collection available here (already somewhat processed to only include relevant columns).

# Get polarities

list_gpc <- readRDS(sprintf("%s/global_polarity_clues.RDS", path))

# Clean

list_gpc <- lapply(list_gpc, clean_and_stem)

# Inspect

str(list_gpc)
## List of 2
##  $ positive: chr [1:17627] "Abfangschirm" "Abgeklaertheit" "Abgeschlossenheit" "Abgleich" ...
##  $ negative: chr [1:19962] "Abandon" "Abart" "Abbau" "Abbaue" ...

quanteda offers dedicated dictionary objects which allow for convenient look-up of dfm’s, so we convert the GPC to such a format (note that this operation also takes care of lowercasing):

dictionary_gpc <- quanteda::dictionary(
  list(
    positive = list_gpc$positive,
    negative = list_gpc$negative))

dictionary_gpc
## Dictionary object with 2 key entries.
## - [positive]:
##   - abfangschirm, abgeklaertheit, abgeschlossenheit, abgleich, abgott, abhilf, abhilfemassnahm, abkomm, abkommen, abkuehlung, ableger, abloesen, abloesung, abmachung, abmachungen, abschirmung, abschluss, abschlusss, abschluess, abschluessen [ ... and 11,514 more ]
## - [negative]:
##   - abandon, abart, abbau, abbaue, abbauen, abbildungsfehl, abbildverzerrung, abblendung, abbohrung, abbruch, abbruchstel, abbruech, abbruechen, abdampf, abdankung, abdankungen, abdaempfung, abdaempfungen, abenteuerlichkeit, abenteuerpolitik [ ... and 14,061 more ]

Perform the look-up:

# Look up polarities

twitter_polarities <- quanteda::dfm_lookup(twitter_dfm, dictionary_gpc)

# Convert resulting dfm object to data.table (via data.frame as quanteda does not yet support conversion to data.table)

twitter_polarities <- quanteda::convert(twitter_polarities, to = "data.frame")

twitter_polarities <- data.table::as.data.table(
  twitter_polarities,
  key = "doc_id")

# Inspect

twitter_polarities[sample(twitter_polarities[, .I], 10)]
##                         doc_id positive negative
##  1:     Peter_Beyer16037862001        4        2
##  2: JuergenBraunAfD16070286001        2        4
##  3:   BriHasselmann15932577001        2        3
##  4:   agnieszka_mdb15510815401        3        3
##  5:     jankortemdb15388216201        1        4
##  6:        kleikert15421870801        5        0
##  7:      KaiGehring16099352401        2        2
##  8: ManuelHoeferlin15543111601        5        5
##  9:     UliGroetsch15916122601        2        0
## 10:  HoffmannForest15877429801        3        2

Get number of emojis used

The next feature we are going to extract is the number of emojis (there are also online sources for associating them with polarities).

Good thing we did the emoji extraction before:

# For this, we cannot operate on the texts but need to get the docvars where all non-text variables are stored

twitter_docvars <- data.table::as.data.table(
  cbind(
    doc_id = quanteda::docid(twitter_corpus),
    quanteda::docvars(twitter_corpus)), 
  key = "doc_id")

twitter_docvars[sample(twitter_docvars[, .I], 10)]
##                         doc_id       last_name    first_name
##  1:        MiRo_SPD15821874601 Roth (Heringen)       Michael
##  2:     UweSchummer15962863201        Schummer           Uwe
##  3:         oezoguz15954811201          Özoguz         Aydan
##  4:       JM_Luczak15693072001          Luczak Dr. Jan-Marco
##  5: Beatrix_vStorch15415882801          Storch   Beatrix von
##  6: Beatrix_vStorch15507566401          Storch   Beatrix von
##  7:      helgelindh15809889601           Lindh         Helge
##  8:        DJanecek16091836201         Janecek        Dieter
##  9:     Peter_Beyer16037862001           Beyer         Peter
## 10:        LisaPaus15936849601            Paus          Lisa
##                         wahlkreis_name   party          bundesland
##  1: Werra-Meißner – Hersfeld-Rotenburg     spd              Hessen
##  2:                            Viersen cdu_csu Nordrhein-Westfalen
##  3:                   Hamburg-Wandsbek     spd             Hamburg
##  4:        Berlin-Tempelhof-Schöneberg cdu_csu              Berlin
##  5:                       Berlin-Mitte     afd              Berlin
##  6:                       Berlin-Mitte     afd              Berlin
##  7:                        Wuppertal I     spd Nordrhein-Westfalen
##  8:                 München-West/Mitte  gruene              Bayern
##  9:                        Mettmann II cdu_csu Nordrhein-Westfalen
## 10:  Berlin-Charlottenburg-Wilmersdorf  gruene              Berlin
##     unemployment_rate share_pop_migration        username followers_count
##  1:               5.2                14.0        MiRo_SPD           35028
##  2:               6.7                16.8     UweSchummer            8549
##  3:               7.1                28.3         oezoguz           19092
##  4:               9.4                24.1       JM_Luczak            5202
##  5:               9.4                24.1 Beatrix_vStorch           61329
##  6:               9.4                24.1 Beatrix_vStorch           61329
##  7:               9.6                31.8      helgelindh            9850
##  8:               4.5                34.3        DJanecek           11488
##  9:               6.4                24.4     Peter_Beyer            1799
## 10:               9.4                24.1        LisaPaus            9024
##              created_at favorite_count retweet_count    label       emojis
##  1: 2020-02-20 08:31:00            915           181 negative             
##  2: 2020-08-01 12:52:00              2             0 negative             
##  3: 2020-07-23 05:12:00              9             0 negative             
##  4: 2019-09-24 06:40:00             23             3 negative             
##  5: 2018-11-07 10:58:00            377           141 negative <U+0001F92C>
##  6: 2019-02-21 13:44:00           1228           426 negative <U+0001F929>
##  7: 2020-02-06 11:36:00             67            15 negative             
##  8: 2020-12-28 19:27:00             15             2 positive             
##  9: 2020-10-27 08:10:00              1             0 positive             
## 10: 2020-07-02 10:16:00             12             5 negative <U+0001F92F>
##                                           hashtags
##  1:                                           #AfD
##  2:                                               
##  3:                                               
##  4:                     #Mietendeckel,#Mietspiegel
##  5:                         #MidtermsElections2018
##  6:                                           #SPD
##  7:    #Kubicki,#Thueringen,#Fundamentalopposition
##  8:                                               
##  9:                        #Mond,#USA,#Deutschland
## 10: #Dax,#Geldwaesche,#Wirecard,#Aufsicht,#Skandal
##                                   tags
##  1:                                   
##  2:         @oida_grantler,@HollsteinM
##  3:                                   
##  4: @cducsubt<U+2069>,@BFWBund<U+2069>
##  5:                                   
##  6:                                   
##  7:                                   
##  8:                    @FAZ_Wirtschaft
##  9:      @NASA,@DLR_de,@SOFIAtelescope
## 10:                     @MarcusTheurer
# Count emojis (stored as list objects, so we take the length of these lists)

twitter_emojis <- twitter_docvars[
  , .(doc_id, emojis)
  ][, n_emojis := lengths(emojis)
    ][, emojis := NULL]

list(
  twitter_emojis = twitter_emojis[sample(twitter_emojis[, .I], 10)],
  distribution = table(twitter_emojis$n_emojis))
## $twitter_emojis
##                         doc_id n_emojis
##  1:   BriHasselmann15481791601        0
##  2:  DietmarBartsch15357022801        0
##  3: Beatrix_vStorch15862634401        0
##  4:        DJanecek16016214001        0
##  5: karl_lauterbach15824776201        0
##  6:   renatekuenast16019082001        0
##  7:   EnricoKomning15133498801        0
##  8:           anked15925735801        0
##  9:    MaikBeermann15216265201        0
## 10:   niemamovassat15904086001        0
## 
## $distribution
## 
##    0    1    2    3    4    5    6    7 
## 1121   65   18    7    1    1    1    1

Detect negation

Negation handling with the bag-of-words assumption is difficult. There are more sophisticated approaches, but for this demo we will simply note whether or not an expression of negation is present.

First, define negation indicators (and don’t forget to clean):

tokens_negation <- clean_and_stem(c(
    "nicht", 
    "nie", 
    "niemals", 
    "nein", 
    "niemand", 
    "nix", 
    "nirgends", 
    "kein"))

Then, again use a dictionary to count for each document how often one of the above tokens occurs:

# Create dictionary

dictionary_negation <- quanteda::dictionary(list(negation = tokens_negation))

# Match tokens

twitter_negation <- quanteda::dfm_lookup(twitter_dfm, dictionary_negation)

# Convert resulting dfm object to data.table (via data.frame as quanteda does not yet support conversion to data.table)

twitter_negation <- quanteda::convert(twitter_negation, to = "data.frame")

twitter_negation <- data.table::as.data.table(
  twitter_negation,
  key = "doc_id")

# Inspect

twitter_negation[sample(twitter_negation[, .I], 10)]
##                         doc_id negation
##  1:       ABaerbock15256297201        0
##  2: HajdukBundestag15518854801        0
##  3:      KaiGehring15833458801        1
##  4: Beatrix_vStorch15379738201        1
##  5:  GoeringEckardt15529879801        0
##  6:       JM_Luczak15608635201        0
##  7: karl_lauterbach15721146001        0
##  8:  KonstantinNotz15884889001        0
##  9: Dr_Rainer_Kraft15910718401        0
## 10:   Frank_Magnitz15235207201        0

Get character unigrams

Due to the sheer variety of words that exist in a language a simple tokens representation will typically contain a huge number of features:

quanteda::nfeat(twitter_dfm)
## [1] 7583

This is an overwhelmingly large number for any machine learning classifier (particularly so if we have more features than observations).

We will therefore streamline our tokens further. Most applications include some kind of n-grams. Interestingly, character unigrams seem to work quite well, i.e., rather than entire words or even concatenations, we simply count occurrences of single letters.

Thereby, we effectively coerce our texts to a set of merely 26 features:

# Create a new tokens object, this time consisting of single characters

twitter_tokens_char <- quanteda::tokens(
  twitter_corpus,
  what = "character",
  remove_punct = TRUE,
  remove_symbols = TRUE,
  remove_numbers = TRUE,
  remove_separators = TRUE,
  split_hyphens = TRUE) 

# Convert to dfm

twitter_char_unigrams <-  quanteda::dfm(twitter_tokens_char)

# Convert resulting dfm object to data.table (via data.frame as quanteda does not yet support conversion to data.table)

twitter_char_unigrams <- quanteda::convert(twitter_char_unigrams, to = "data.frame")

twitter_char_unigrams <- data.table::as.data.table(
  twitter_char_unigrams,
  key = "doc_id")

data.table::setcolorder(twitter_char_unigrams, c("doc_id", letters))

# Inspect

twitter_char_unigrams[sample(twitter_char_unigrams[, .I], 10)]
##                         doc_id  a b c  d  e  f  g  h  i j k  l m  n o p q  r  s
##  1: Beatrix_vStorch15862634401 18 5 5 13 37  6  5 11 11 0 3  6 3 14 7 6 0 22 11
##  2:   FrStraetmanns15923979601  8 6 8 12 30  4  6  9 18 0 3  5 3 14 5 1 0 12 10
##  3:   MatthiasHauer15918112801 11 0 3  6 21  1  7  5 11 0 2  5 0 14 5 3 0 11 13
##  4:          katdro15869262601 15 2 9 13 38  5  7 11 17 0 1  7 4 22 5 2 0 15 18
##  5:       jensspahn15809874001 19 2 5 12 43  3 10  8 19 1 5  8 5 21 6 1 0 18 15
##  6:        HuberMdB15830672401  8 3 9 11 35  3  8 11 11 0 4  4 1 16 7 1 0 17 12
##  7:   DrAndreasNick16057137001 16 2 4  8 23  5  1  9 13 2 4 10 6 19 7 3 0 13 14
##  8:      falkomohrs15414163201 14 1 6  7 36  3  4 11 16 0 4 15 5 11 7 4 0 18 25
##  9:    HeikeHaensel15236835601  9 3 4 10 40 11  7  7 18 0 3  3 3 21 8 1 0 25 15
## 10:   MetinHakverdi15736454401 11 2 7  5 32  4 10 12 17 0 3  8 5 19 7 7 0 13 11
##      t  u v w x y z
##  1: 11 10 2 2 0 0 2
##  2:  6 10 1 0 0 0 0
##  3:  5  9 3 2 0 0 3
##  4: 16 12 1 4 2 0 2
##  5: 14  8 3 2 0 0 5
##  6: 14 12 2 5 0 0 3
##  7:  8 10 2 1 0 0 1
##  8: 16 12 3 2 0 0 4
##  9: 17 17 4 5 0 1 3
## 10: 15 11 0 1 0 1 4

Get POS tags

The last set of features we are going to extract are part-of-speech (POS) tags. Note that tagging must take place over the original documents, not some tokens representation agnostic to grammatical structures.

We use the spacyR package, a wrapper around Python’s spaCy package that is nicely integrated with quanteda (well, they share the same author after all). This vignette provides a quick intro.

Note that setting this up can be a serious pain – getting R and Python to work hand in hand on your local machine is hard sometimes. When in doubt, resort to some clean environment such as Google Colab.

# Install spacyr and get the relevant model for German language

spacyr::spacy_install()
spacyr::spacy_download_langmodel("de")

# Run POS tagger

spacyr::spacy_initialize(model = "de_core_news_sm")
  
twitter_pos_tags <- data.table::as.data.table(
  spacyr::spacy_parse(
    twitter_corpus,
    lemma = FALSE, # do not include lemmatized tokens
    entity = FALSE), # do not include named entities
  key = "doc_id")
head(twitter_pos_tags, 10)
##                   doc_id sentence_id token_id                token   pos
##  1: ABaerbock15156546001           1        1                 Habe   AUX
##  2: ABaerbock15156546001           1        2                  mir  PRON
##  3: ABaerbock15156546001           1        3                  das   DET
##  4: ABaerbock15156546001           1        4                  Gro PROPN
##  5: ABaerbock15156546001           1        5 Ko-Sondierungspapier  NOUN
##  6: ABaerbock15156546001           1        6                   zu   ADP
##  7: ABaerbock15156546001           1        7                Klima  NOUN
##  8: ABaerbock15156546001           1        8              nochmal   ADV
##  9: ABaerbock15156546001           1        9                genau   ADJ
## 10: ABaerbock15156546001           1       10           angeschaut  VERB

Looking good already! For the POS tags to be admissible in sentiment analysis, however, we need to convert them from long to wide format:

# Count occurrences per document and POS tag

twitter_pos_tags <- twitter_pos_tags[
  , .(doc_id, pos)
  ][, aux := 1
    ][, n_tags := sum(aux), by = list(doc_id, pos)
      ][, aux := NULL]

head(twitter_pos_tags)
##                  doc_id   pos n_tags
## 1: ABaerbock15156546001   AUX      2
## 2: ABaerbock15156546001  PRON      2
## 3: ABaerbock15156546001   DET      4
## 4: ABaerbock15156546001 PROPN      2
## 5: ABaerbock15156546001  NOUN      8
## 6: ABaerbock15156546001   ADP      1
# Convert to wide format

twitter_pos_tags <- data.table::dcast(
  unique(twitter_pos_tags),
  doc_id ~ pos,
  value.var = "n_tags",
  fun.aggregate = sum)

head(twitter_pos_tags)
##                  doc_id ADJ ADP ADV AUX CCONJ DET INTJ NOUN NUM PART PRON PROPN
## 1: ABaerbock15156546001   4   1   3   2     0   4    0    8   0    1    2     2
## 2: ABaerbock15210084001   5   0   5   1     1   1    0    2   1    1    1     4
## 3: ABaerbock15216341401   4   4   2   1     0   6    0   10   2    2    3     0
## 4: ABaerbock15252349801   5   4   3   1     2   4    0    9   0    1    5     0
## 5: ABaerbock15256297201   2   4   1   5     3   2    0   10   0    0    5     2
## 6: ABaerbock15283779001   2   4   2   2     0   4    0    8   0    2    5     3
##    PUNCT SCONJ VERB X
## 1:     7     0    3 2
## 2:     5     1    3 1
## 3:     7     0    4 0
## 4:     6     0    3 0
## 5:     6     1    4 0
## 6:     6     0    9 0

Collect all static features

In a final step, we put everything together (starting the join operation with a list of all document IDs so we lose nothing):

# Collect all features

twitter_features_static <- twitter_docvars[, .(doc_id, label)
  ][twitter_pos_tags,
    ][twitter_negation,
      ][twitter_char_unigrams,
        ][twitter_emojis,
          ][twitter_polarities, ]

saveRDS(twitter_features_static, sprintf("%s/twitter_features_static.RDS", path))

twitter_features_static[sample(twitter_features_static[, .I], 10)]
##                         doc_id    label ADJ ADP ADV AUX CCONJ DET INTJ NOUN NUM
##  1:     b_riexinger15374616001 negative   4   3   4   0     2   6    0    8   0
##  2:      Kaiser_SPD15533487601 positive   5   4   1   1     0   3    0    9   0
##  3:   Schneider_AfD16028714401 positive   3   1   2   2     1   6    0    7   0
##  4:     PaulZiemiak15744411601 positive   4   4   3   3     0   2    0    5   0
##  5: karl_lauterbach15812601001 negative   0   3   4   1     3   1    0    9   0
##  6:   hubertus_heil15744274201 positive   5   2   1   1     2   2    0    8   0
##  7:        HuberMdB15468630001 positive   3   4   0   0     0   5    0    7   1
##  8:      KaiGehring15742380601 positive   4   4   1   4     0   0    0   10   1
##  9:    juergenhardt15442605601 positive   5   3   1   0     2   0    0    5   0
## 10:       SoenkeRix15547031401 positive   2   0   2   1     0   0    0    2   0
##     PART PRON PROPN PUNCT SCONJ VERB X negation  a b  c  d  e f g  h  i j k  l
##  1:    0    0     4     3     0    6 0        0 10 3  3 12 38 7 4 10 19 0 3  8
##  2:    0    0     4     4     0    2 0        0 15 1  2  8 29 7 9  3 13 0 5  6
##  3:    0    1     4     3     0    2 0        1 10 7  4  7 24 0 2  6 14 0 5  8
##  4:    0   10     1     6     0    6 0        0 15 5 10  6 35 3 6 13 19 0 2  8
##  5:    1    6     1     6     1    9 0        1 16 2  3  6 39 3 9  7 13 1 1 13
##  6:    0    2     0     4     0    4 0        0 11 6  4  7 27 2 3  8 11 1 4  3
##  7:    0    0     0     3     0    1 0        0  5 0  2  6 17 3 7  4 12 0 2  1
##  8:    1    2     1     8     0    4 0        0 11 2  8  5 31 5 2 12 18 1 5  4
##  9:    0    0     3     3     0    1 0        0 11 1  6  7 18 3 2  7  7 1 2  2
## 10:    1    2     0     2     1    1 0        1  3 0  2  3 10 3 2  3  8 1 0  2
##      m  n o p q  r  s  t  u v w x y z n_emojis positive negative
##  1:  9 20 8 5 0 12  7  9  6 0 4 0 0 3        0        4        4
##  2:  8 19 9 3 0 14 12 16 10 3 2 0 0 1        1        4        2
##  3:  4 13 9 1 0 11  7 10  5 1 3 0 0 3        0        3        1
##  4: 13 14 2 1 0  8 10 12  8 0 6 0 0 3        0        7        0
##  5:  8 13 4 3 0 20 17 21  6 0 5 0 0 4        0        7        3
##  6:  3 16 5 3 0  6  9 17  5 1 1 0 0 6        0        9        2
##  7:  7 14 6 2 0  5  2  8  4 0 1 0 0 0        0        2        0
##  8:  6 17 0 1 0 16 12 13  9 0 5 0 0 3        0        7        4
##  9:  3 14 3 3 0 12 12 13 10 2 0 0 0 2        0        1        0
## 10:  1  8 0 2 0  5  4  4  3 1 1 0 0 0        0        3        1

This provides us with a solid foundation for sentiment analysis! We now represent each document by a total of 47 variables. Of course we can think of many more features to include here – we will see some more in the exercise; in general, be creative and see how classification can be improved by adding different explanatory variables.