This demo is about static feature extraction.
We show how to extract several features deemed useful for sentiment analysis. Static refers to these features being independent across documents (unlike, for instance, word embeddings).
# Load required packages
library(data.table)
library(quanteda)
library(stringi)
library(stringr)
We will extract a set of static features that have been shown to work well for sentiment analysis in general, including:
tokens
objectThe following analyses all revolve around the presence/absence or number of specific text tokens across documents. Therefore, we convert our corpus
object from before to a tokens
, where documents are represented by single tokens rather than a fluent text, and along the way omit casing.
# path <- ... (individual file location)
twitter_corpus <- readRDS(sprintf("%s/twitter_corpus.RDS", path))
twitter_tokens <- quanteda::tokens(
twitter_corpus,
what = "word", # take tokens to be words
remove_symbols = TRUE, # in case we forgot anything earlier
remove_numbers = TRUE, # not relevant
remove_separators = TRUE, # not relevant
remove_punct = TRUE, # not relevant
split_hyphens = TRUE, # sometimes useful with German language
include_docvars = TRUE) # keep additional variables
twitter_tokens <- quanteda::tokens_tolower(twitter_tokens)
On the occasion, we perform stemming on our tokens to increase their congruency across documents (remember, we want our documents to be represented by as many common tokens as possible). quanteda
has a built-in functionality that in turn calls SnowballC
’s wordStem
function:
twitter_tokens <- quanteda::tokens_wordstem(
twitter_tokens,
language = "german")
twitter_tokens
## Tokens consisting of 1,215 documents and 16 docvars.
## ABaerbock15156546001 :
## [1] "hab" "mir" "das" "gro"
## [5] "ko" "sondierungspapi" "zu" "klima"
## [9] "nochmal" "genau" "angeschaut" "krass"
## [ ... and 22 more ]
##
## ABaerbock15210084001 :
## [1] "auch" "weltweit" "sieht" "man" "dass" "angela"
## [7] "merkel" "s" "groko" "nicht" "klimataug" "ist"
## [ ... and 13 more ]
##
## ABaerbock15216341401 :
## [1] "der" "groesst" "global"
## [4] "herausforder" "der" "klimakris"
## [7] "sek" "von" "std"
## [10] "regierungserklaer" "zu" "widm"
## [ ... and 24 more ]
##
## ABaerbock15252349801 :
## [1] "wir" "brauch" "ein" "and"
## [5] "verkehrspolit" "weg" "von" "benzin"
## [9] "und" "diesel" "hin" "zu"
## [ ... and 25 more ]
##
## ABaerbock15256297201 :
## [1] "das" "ist" "die" "leistung" "von"
## [6] "hundert" "wahlkaempf" "und" "polit" "im"
## [11] "land" "die"
## [ ... and 27 more ]
##
## ABaerbock15283779001 :
## [1] "luis" "amtsberg" "hat" "heut"
## [5] "all" "gesagt" "was" "es"
## [9] "zum" "familiennachzug" "zu" "sag"
## [ ... and 28 more ]
##
## [ reached max_ndoc ... 1,209 more documents ]
We quickly spot some tokens that do not seem to be of much help (e.g., “zu”). This is where stopwords come in. We define a list of such stopwords and have quanteda
remove them from our tokens
.
However, we need to account for the fact that we have done quite a bit of text manipulation so far – in particular, we have removed umlauts from our tokens and reduced them to their word stem. In order to make sure our stopwords are matched correctly, we should give them the same treatment.
In fact, we will do this cleaning-stemming transformation more than once, so we encapsulate it in a small function we can call again later:
clean_and_stem <- function(text) {
# Convert to uniform encoding
text <- stringi::stri_trans_general(text, "Any-Latin")
# Replace umlauts and litigate s
text <- stringr::str_replace_all(
text,
c("\u00c4" = "Ae",
"\u00e4" = "ae",
"\u00d6" = "Oe",
"\u00f6" = "oe",
"\u00dc" = "Ue",
"\u00fc" = "ue",
"\u00df" = "ss"))
# Stem
text <- SnowballC::wordStem(text)
text
}
clean_and_stem(c("Döner", "Bereicherung", "kulinarischer", "Angebote"))
## [1] "Doener" "Bereicherung" "kulinarisch" "Angebot"
Seems to do the job (but also note the limitations of stemming).
We create a clean stopwords list, which we again wrap in a function we can re-use:
# Create stopwords list
stopwords <- clean_and_stem(quanteda::stopwords(language = "de"))
# Remove potential duplicates and empty instances induced by stemming
stopwords <- unique(stopwords)
stopwords <- stopwords[nchar(stopwords) > 0]
# Inspect
sort(stopwords)
## [1] "aber" "al" "all" "allem" "allen" "aller"
## [7] "also" "am" "an" "ander" "anderem" "anderen"
## [13] "anderm" "andern" "anderr" "au" "auch" "auf"
## [19] "bei" "bi" "bin" "bist" "da" "damit"
## [25] "dann" "dass" "dasselb" "dazu" "de" "dein"
## [31] "deinem" "deinen" "deiner" "dem" "demselben" "den"
## [37] "denn" "denselben" "der" "derer" "derselb" "derselben"
## [43] "desselben" "dessen" "di" "dich" "die" "dies"
## [49] "dieselb" "dieselben" "diesem" "diesen" "dieser" "dir"
## [55] "doch" "dort" "du" "durch" "e" "ein"
## [61] "einem" "einen" "einer" "einig" "einigem" "einigen"
## [67] "einmal" "er" "etwa" "euch" "euer" "eur"
## [73] "eurem" "euren" "eurer" "fuer" "gegen" "gewesen"
## [79] "hab" "habe" "haben" "hat" "hatt" "hatten"
## [85] "hier" "hin" "hinter" "ich" "ihm" "ihn"
## [91] "ihnen" "ihr" "ihrem" "ihren" "ihrer" "im"
## [97] "in" "indem" "ist" "jede" "jedem" "jeden"
## [103] "jeder" "jene" "jenem" "jenen" "jener" "jetzt"
## [109] "kann" "kein" "keinem" "keinen" "keiner" "koennen"
## [115] "koennt" "machen" "man" "manch" "manchem" "manchen"
## [121] "mancher" "mein" "meinem" "meinen" "meiner" "mich"
## [127] "mir" "mit" "muss" "musst" "nach" "nicht"
## [133] "noch" "nun" "nur" "ob" "oder" "ohn"
## [139] "sehr" "sein" "seinem" "seinen" "seiner" "selbst"
## [145] "sich" "sie" "sind" "so" "solch" "solchem"
## [151] "solchen" "solcher" "soll" "sollt" "sondern" "sonst"
## [157] "ueber" "um" "un" "und" "uns" "unsem"
## [163] "unsen" "unser" "unter" "viel" "vom" "von"
## [169] "vor" "wa" "waehrend" "war" "waren" "warst"
## [175] "weg" "weil" "weiter" "welch" "welchem" "welchen"
## [181] "welcher" "wenn" "werd" "werden" "wie" "wieder"
## [187] "will" "wir" "wird" "wirst" "wo" "wollen"
## [193] "wollt" "wuerd" "wuerden" "zu" "zum" "zur"
## [199] "zwar" "zwischen"
Words indicating negation seem too important to discard for the sentiment analysis task. We cross them from the stopwords list and remove the remaining stopwords from our tokens
:
# Deselect negation words
stopwords <- stringr::str_remove_all(
stopwords,
"kein(.)*|nicht")
# Remove from tokens
twitter_tokens <- quanteda::tokens_remove(twitter_tokens, stopwords)
# Inspect
twitter_tokens
## Tokens consisting of 1,215 documents and 16 docvars.
## ABaerbock15156546001 :
## [1] "das" "gro" "ko" "sondierungspapi"
## [5] "klima" "nochmal" "genau" "angeschaut"
## [9] "krass" "facto" "sogar" "das"
## [ ... and 13 more ]
##
## ABaerbock15210084001 :
## [1] "weltweit" "sieht" "angela" "merkel" "s" "groko"
## [7] "nicht" "klimataug" "sond" "vielmehr" "das" "paris"
## [ ... and 7 more ]
##
## ABaerbock15216341401 :
## [1] "groesst" "global" "herausforder"
## [4] "klimakris" "sek" "std"
## [7] "regierungserklaer" "widm" "unterstreicht"
## [10] "groesst" "leerstell" "koalitionsvertrag"
## [ ... and 10 more ]
##
## ABaerbock15252349801 :
## [1] "brauch" "and" "verkehrspolit" "benzin"
## [5] "diesel" "emissionsarm" "mobilitaet" "das"
## [9] "sozial" "frag" "unt" "schlecht"
## [ ... and 8 more ]
##
## ABaerbock15256297201 :
## [1] "das" "leistung" "hundert" "wahlkaempf" "polit"
## [6] "land" "drauss" "gezeigt" "dicht" "was"
## [11] "mensch" "umtreibt"
## [ ... and 8 more ]
##
## ABaerbock15283779001 :
## [1] "luis" "amtsberg" "heut" "gesagt"
## [5] "was" "es" "familiennachzug" "sag"
## [9] "gibt" "wer" "integration" "gefluechtet"
## [ ... and 14 more ]
##
## [ reached max_ndoc ... 1,209 more documents ]
Apparently, some very short word stumps result from our cleaning operations, which we will also discard:
twitter_tokens <- quanteda::tokens_select(twitter_tokens, min_nchar = 3)
twitter_tokens <- quanteda::tokens_remove(
twitter_tokens,
c("der", "die", "das", "was", "wer", "wie", "ich", "sie", "wir", "ihr"))
twitter_tokens
## Tokens consisting of 1,215 documents and 16 docvars.
## ABaerbock15156546001 :
## [1] "gro" "sondierungspapi" "klima" "nochmal"
## [5] "genau" "angeschaut" "krass" "facto"
## [9] "sogar" "kyoto" "protokoll" "meilenstein"
## [ ... and 9 more ]
##
## ABaerbock15210084001 :
## [1] "weltweit" "sieht" "angela" "merkel" "groko"
## [6] "nicht" "klimataug" "sond" "vielmehr" "paris"
## [11] "abkomm" "unterlaeuft"
## [ ... and 5 more ]
##
## ABaerbock15216341401 :
## [1] "groesst" "global" "herausforder"
## [4] "klimakris" "sek" "std"
## [7] "regierungserklaer" "widm" "unterstreicht"
## [10] "groesst" "leerstell" "koalitionsvertrag"
## [ ... and 9 more ]
##
## ABaerbock15252349801 :
## [1] "brauch" "and" "verkehrspolit" "benzin"
## [5] "diesel" "emissionsarm" "mobilitaet" "sozial"
## [9] "frag" "unt" "schlecht" "luft"
## [ ... and 7 more ]
##
## ABaerbock15256297201 :
## [1] "leistung" "hundert" "wahlkaempf" "polit" "land"
## [6] "drauss" "gezeigt" "dicht" "mensch" "umtreibt"
## [11] "konkret" "beweis"
## [ ... and 6 more ]
##
## ABaerbock15283779001 :
## [1] "luis" "amtsberg" "heut" "gesagt"
## [5] "familiennachzug" "sag" "gibt" "integration"
## [9] "gefluechtet" "perspektiv" "zusammenleb" "famili"
## [ ... and 9 more ]
##
## [ reached max_ndoc ... 1,209 more documents ]
There is still a variety of tokens we could probably remove (there are more extensive stopwords lists to be found online, the built-in one from quanteda
is not too good tbh). In a real application you should dedicate some time to this sub-task to make sure your tokens representation is not contaminated by such noise.
dfm
objectNow we have characterized our documents by tokens, we want to put some numbers on it: how often do the present tokens occur in each document?
This is where the document-feature matrix (dfm
) comes in:
twitter_dfm <- quanteda::dfm(twitter_tokens)
twitter_dfm
## Document-feature matrix of: 1,215 documents, 7,583 features (99.8% sparse) and 16 docvars.
## features
## docs gro sondierungspapi klima nochmal genau angeschaut krass
## ABaerbock15156546001 1 1 1 1 1 1 1
## ABaerbock15210084001 0 0 0 0 0 0 0
## ABaerbock15216341401 0 0 0 0 0 0 0
## ABaerbock15252349801 0 0 0 0 0 0 0
## ABaerbock15256297201 0 0 0 0 0 0 0
## ABaerbock15283779001 0 0 0 0 1 0 0
## features
## docs facto sogar kyoto
## ABaerbock15156546001 1 1 1
## ABaerbock15210084001 0 0 0
## ABaerbock15216341401 0 0 0
## ABaerbock15252349801 0 0 0
## ABaerbock15256297201 0 0 0
## ABaerbock15283779001 0 0 0
## [ reached max_ndoc ... 1,209 more documents, reached max_nfeat ... 7,573 more features ]
saveRDS(twitter_dfm, sprintf("%s/twitter_dfm.RDS", path))
Time for some feature extraction at last.
The first static feature we will extract is the number of (positive or negative) polarity clues in the data.
Polarity clues are sentiment-bearing words like “grauenhaft”. There is a variety of open-source lists for these; we will use the Global Polarity Clues collection available here (already somewhat processed to only include relevant columns).
# Get polarities
list_gpc <- readRDS(sprintf("%s/global_polarity_clues.RDS", path))
# Clean
list_gpc <- lapply(list_gpc, clean_and_stem)
# Inspect
str(list_gpc)
## List of 2
## $ positive: chr [1:17627] "Abfangschirm" "Abgeklaertheit" "Abgeschlossenheit" "Abgleich" ...
## $ negative: chr [1:19962] "Abandon" "Abart" "Abbau" "Abbaue" ...
quanteda
offers dedicated dictionary
objects which allow for convenient look-up of dfm
’s, so we convert the GPC to such a format (note that this operation also takes care of lowercasing):
dictionary_gpc <- quanteda::dictionary(
list(
positive = list_gpc$positive,
negative = list_gpc$negative))
dictionary_gpc
## Dictionary object with 2 key entries.
## - [positive]:
## - abfangschirm, abgeklaertheit, abgeschlossenheit, abgleich, abgott, abhilf, abhilfemassnahm, abkomm, abkommen, abkuehlung, ableger, abloesen, abloesung, abmachung, abmachungen, abschirmung, abschluss, abschlusss, abschluess, abschluessen [ ... and 11,514 more ]
## - [negative]:
## - abandon, abart, abbau, abbaue, abbauen, abbildungsfehl, abbildverzerrung, abblendung, abbohrung, abbruch, abbruchstel, abbruech, abbruechen, abdampf, abdankung, abdankungen, abdaempfung, abdaempfungen, abenteuerlichkeit, abenteuerpolitik [ ... and 14,061 more ]
Perform the look-up:
# Look up polarities
twitter_polarities <- quanteda::dfm_lookup(twitter_dfm, dictionary_gpc)
# Convert resulting dfm object to data.table (via data.frame as quanteda does not yet support conversion to data.table)
twitter_polarities <- quanteda::convert(twitter_polarities, to = "data.frame")
twitter_polarities <- data.table::as.data.table(
twitter_polarities,
key = "doc_id")
# Inspect
twitter_polarities[sample(twitter_polarities[, .I], 10)]
## doc_id positive negative
## 1: Peter_Beyer16037862001 4 2
## 2: JuergenBraunAfD16070286001 2 4
## 3: BriHasselmann15932577001 2 3
## 4: agnieszka_mdb15510815401 3 3
## 5: jankortemdb15388216201 1 4
## 6: kleikert15421870801 5 0
## 7: KaiGehring16099352401 2 2
## 8: ManuelHoeferlin15543111601 5 5
## 9: UliGroetsch15916122601 2 0
## 10: HoffmannForest15877429801 3 2
The next feature we are going to extract is the number of emojis (there are also online sources for associating them with polarities).
Good thing we did the emoji extraction before:
# For this, we cannot operate on the texts but need to get the docvars where all non-text variables are stored
twitter_docvars <- data.table::as.data.table(
cbind(
doc_id = quanteda::docid(twitter_corpus),
quanteda::docvars(twitter_corpus)),
key = "doc_id")
twitter_docvars[sample(twitter_docvars[, .I], 10)]
## doc_id last_name first_name
## 1: MiRo_SPD15821874601 Roth (Heringen) Michael
## 2: UweSchummer15962863201 Schummer Uwe
## 3: oezoguz15954811201 Özoguz Aydan
## 4: JM_Luczak15693072001 Luczak Dr. Jan-Marco
## 5: Beatrix_vStorch15415882801 Storch Beatrix von
## 6: Beatrix_vStorch15507566401 Storch Beatrix von
## 7: helgelindh15809889601 Lindh Helge
## 8: DJanecek16091836201 Janecek Dieter
## 9: Peter_Beyer16037862001 Beyer Peter
## 10: LisaPaus15936849601 Paus Lisa
## wahlkreis_name party bundesland
## 1: Werra-Meißner – Hersfeld-Rotenburg spd Hessen
## 2: Viersen cdu_csu Nordrhein-Westfalen
## 3: Hamburg-Wandsbek spd Hamburg
## 4: Berlin-Tempelhof-Schöneberg cdu_csu Berlin
## 5: Berlin-Mitte afd Berlin
## 6: Berlin-Mitte afd Berlin
## 7: Wuppertal I spd Nordrhein-Westfalen
## 8: München-West/Mitte gruene Bayern
## 9: Mettmann II cdu_csu Nordrhein-Westfalen
## 10: Berlin-Charlottenburg-Wilmersdorf gruene Berlin
## unemployment_rate share_pop_migration username followers_count
## 1: 5.2 14.0 MiRo_SPD 35028
## 2: 6.7 16.8 UweSchummer 8549
## 3: 7.1 28.3 oezoguz 19092
## 4: 9.4 24.1 JM_Luczak 5202
## 5: 9.4 24.1 Beatrix_vStorch 61329
## 6: 9.4 24.1 Beatrix_vStorch 61329
## 7: 9.6 31.8 helgelindh 9850
## 8: 4.5 34.3 DJanecek 11488
## 9: 6.4 24.4 Peter_Beyer 1799
## 10: 9.4 24.1 LisaPaus 9024
## created_at favorite_count retweet_count label emojis
## 1: 2020-02-20 08:31:00 915 181 negative
## 2: 2020-08-01 12:52:00 2 0 negative
## 3: 2020-07-23 05:12:00 9 0 negative
## 4: 2019-09-24 06:40:00 23 3 negative
## 5: 2018-11-07 10:58:00 377 141 negative <U+0001F92C>
## 6: 2019-02-21 13:44:00 1228 426 negative <U+0001F929>
## 7: 2020-02-06 11:36:00 67 15 negative
## 8: 2020-12-28 19:27:00 15 2 positive
## 9: 2020-10-27 08:10:00 1 0 positive
## 10: 2020-07-02 10:16:00 12 5 negative <U+0001F92F>
## hashtags
## 1: #AfD
## 2:
## 3:
## 4: #Mietendeckel,#Mietspiegel
## 5: #MidtermsElections2018
## 6: #SPD
## 7: #Kubicki,#Thueringen,#Fundamentalopposition
## 8:
## 9: #Mond,#USA,#Deutschland
## 10: #Dax,#Geldwaesche,#Wirecard,#Aufsicht,#Skandal
## tags
## 1:
## 2: @oida_grantler,@HollsteinM
## 3:
## 4: @cducsubt<U+2069>,@BFWBund<U+2069>
## 5:
## 6:
## 7:
## 8: @FAZ_Wirtschaft
## 9: @NASA,@DLR_de,@SOFIAtelescope
## 10: @MarcusTheurer
# Count emojis (stored as list objects, so we take the length of these lists)
twitter_emojis <- twitter_docvars[
, .(doc_id, emojis)
][, n_emojis := lengths(emojis)
][, emojis := NULL]
list(
twitter_emojis = twitter_emojis[sample(twitter_emojis[, .I], 10)],
distribution = table(twitter_emojis$n_emojis))
## $twitter_emojis
## doc_id n_emojis
## 1: BriHasselmann15481791601 0
## 2: DietmarBartsch15357022801 0
## 3: Beatrix_vStorch15862634401 0
## 4: DJanecek16016214001 0
## 5: karl_lauterbach15824776201 0
## 6: renatekuenast16019082001 0
## 7: EnricoKomning15133498801 0
## 8: anked15925735801 0
## 9: MaikBeermann15216265201 0
## 10: niemamovassat15904086001 0
##
## $distribution
##
## 0 1 2 3 4 5 6 7
## 1121 65 18 7 1 1 1 1
Negation handling with the bag-of-words assumption is difficult. There are more sophisticated approaches, but for this demo we will simply note whether or not an expression of negation is present.
First, define negation indicators (and don’t forget to clean):
tokens_negation <- clean_and_stem(c(
"nicht",
"nie",
"niemals",
"nein",
"niemand",
"nix",
"nirgends",
"kein"))
Then, again use a dictionary
to count for each document how often one of the above tokens occurs:
# Create dictionary
dictionary_negation <- quanteda::dictionary(list(negation = tokens_negation))
# Match tokens
twitter_negation <- quanteda::dfm_lookup(twitter_dfm, dictionary_negation)
# Convert resulting dfm object to data.table (via data.frame as quanteda does not yet support conversion to data.table)
twitter_negation <- quanteda::convert(twitter_negation, to = "data.frame")
twitter_negation <- data.table::as.data.table(
twitter_negation,
key = "doc_id")
# Inspect
twitter_negation[sample(twitter_negation[, .I], 10)]
## doc_id negation
## 1: ABaerbock15256297201 0
## 2: HajdukBundestag15518854801 0
## 3: KaiGehring15833458801 1
## 4: Beatrix_vStorch15379738201 1
## 5: GoeringEckardt15529879801 0
## 6: JM_Luczak15608635201 0
## 7: karl_lauterbach15721146001 0
## 8: KonstantinNotz15884889001 0
## 9: Dr_Rainer_Kraft15910718401 0
## 10: Frank_Magnitz15235207201 0
Due to the sheer variety of words that exist in a language a simple tokens representation will typically contain a huge number of features:
quanteda::nfeat(twitter_dfm)
## [1] 7583
This is an overwhelmingly large number for any machine learning classifier (particularly so if we have more features than observations).
We will therefore streamline our tokens further. Most applications include some kind of n-grams. Interestingly, character unigrams seem to work quite well, i.e., rather than entire words or even concatenations, we simply count occurrences of single letters.
Thereby, we effectively coerce our texts to a set of merely 26 features:
# Create a new tokens object, this time consisting of single characters
twitter_tokens_char <- quanteda::tokens(
twitter_corpus,
what = "character",
remove_punct = TRUE,
remove_symbols = TRUE,
remove_numbers = TRUE,
remove_separators = TRUE,
split_hyphens = TRUE)
# Convert to dfm
twitter_char_unigrams <- quanteda::dfm(twitter_tokens_char)
# Convert resulting dfm object to data.table (via data.frame as quanteda does not yet support conversion to data.table)
twitter_char_unigrams <- quanteda::convert(twitter_char_unigrams, to = "data.frame")
twitter_char_unigrams <- data.table::as.data.table(
twitter_char_unigrams,
key = "doc_id")
data.table::setcolorder(twitter_char_unigrams, c("doc_id", letters))
# Inspect
twitter_char_unigrams[sample(twitter_char_unigrams[, .I], 10)]
## doc_id a b c d e f g h i j k l m n o p q r s
## 1: Beatrix_vStorch15862634401 18 5 5 13 37 6 5 11 11 0 3 6 3 14 7 6 0 22 11
## 2: FrStraetmanns15923979601 8 6 8 12 30 4 6 9 18 0 3 5 3 14 5 1 0 12 10
## 3: MatthiasHauer15918112801 11 0 3 6 21 1 7 5 11 0 2 5 0 14 5 3 0 11 13
## 4: katdro15869262601 15 2 9 13 38 5 7 11 17 0 1 7 4 22 5 2 0 15 18
## 5: jensspahn15809874001 19 2 5 12 43 3 10 8 19 1 5 8 5 21 6 1 0 18 15
## 6: HuberMdB15830672401 8 3 9 11 35 3 8 11 11 0 4 4 1 16 7 1 0 17 12
## 7: DrAndreasNick16057137001 16 2 4 8 23 5 1 9 13 2 4 10 6 19 7 3 0 13 14
## 8: falkomohrs15414163201 14 1 6 7 36 3 4 11 16 0 4 15 5 11 7 4 0 18 25
## 9: HeikeHaensel15236835601 9 3 4 10 40 11 7 7 18 0 3 3 3 21 8 1 0 25 15
## 10: MetinHakverdi15736454401 11 2 7 5 32 4 10 12 17 0 3 8 5 19 7 7 0 13 11
## t u v w x y z
## 1: 11 10 2 2 0 0 2
## 2: 6 10 1 0 0 0 0
## 3: 5 9 3 2 0 0 3
## 4: 16 12 1 4 2 0 2
## 5: 14 8 3 2 0 0 5
## 6: 14 12 2 5 0 0 3
## 7: 8 10 2 1 0 0 1
## 8: 16 12 3 2 0 0 4
## 9: 17 17 4 5 0 1 3
## 10: 15 11 0 1 0 1 4
In a final step, we put everything together (starting the join operation with a list of all document IDs so we lose nothing):
# Collect all features
twitter_features_static <- twitter_docvars[, .(doc_id, label)
][twitter_pos_tags,
][twitter_negation,
][twitter_char_unigrams,
][twitter_emojis,
][twitter_polarities, ]
saveRDS(twitter_features_static, sprintf("%s/twitter_features_static.RDS", path))
twitter_features_static[sample(twitter_features_static[, .I], 10)]
## doc_id label ADJ ADP ADV AUX CCONJ DET INTJ NOUN NUM
## 1: b_riexinger15374616001 negative 4 3 4 0 2 6 0 8 0
## 2: Kaiser_SPD15533487601 positive 5 4 1 1 0 3 0 9 0
## 3: Schneider_AfD16028714401 positive 3 1 2 2 1 6 0 7 0
## 4: PaulZiemiak15744411601 positive 4 4 3 3 0 2 0 5 0
## 5: karl_lauterbach15812601001 negative 0 3 4 1 3 1 0 9 0
## 6: hubertus_heil15744274201 positive 5 2 1 1 2 2 0 8 0
## 7: HuberMdB15468630001 positive 3 4 0 0 0 5 0 7 1
## 8: KaiGehring15742380601 positive 4 4 1 4 0 0 0 10 1
## 9: juergenhardt15442605601 positive 5 3 1 0 2 0 0 5 0
## 10: SoenkeRix15547031401 positive 2 0 2 1 0 0 0 2 0
## PART PRON PROPN PUNCT SCONJ VERB X negation a b c d e f g h i j k l
## 1: 0 0 4 3 0 6 0 0 10 3 3 12 38 7 4 10 19 0 3 8
## 2: 0 0 4 4 0 2 0 0 15 1 2 8 29 7 9 3 13 0 5 6
## 3: 0 1 4 3 0 2 0 1 10 7 4 7 24 0 2 6 14 0 5 8
## 4: 0 10 1 6 0 6 0 0 15 5 10 6 35 3 6 13 19 0 2 8
## 5: 1 6 1 6 1 9 0 1 16 2 3 6 39 3 9 7 13 1 1 13
## 6: 0 2 0 4 0 4 0 0 11 6 4 7 27 2 3 8 11 1 4 3
## 7: 0 0 0 3 0 1 0 0 5 0 2 6 17 3 7 4 12 0 2 1
## 8: 1 2 1 8 0 4 0 0 11 2 8 5 31 5 2 12 18 1 5 4
## 9: 0 0 3 3 0 1 0 0 11 1 6 7 18 3 2 7 7 1 2 2
## 10: 1 2 0 2 1 1 0 1 3 0 2 3 10 3 2 3 8 1 0 2
## m n o p q r s t u v w x y z n_emojis positive negative
## 1: 9 20 8 5 0 12 7 9 6 0 4 0 0 3 0 4 4
## 2: 8 19 9 3 0 14 12 16 10 3 2 0 0 1 1 4 2
## 3: 4 13 9 1 0 11 7 10 5 1 3 0 0 3 0 3 1
## 4: 13 14 2 1 0 8 10 12 8 0 6 0 0 3 0 7 0
## 5: 8 13 4 3 0 20 17 21 6 0 5 0 0 4 0 7 3
## 6: 3 16 5 3 0 6 9 17 5 1 1 0 0 6 0 9 2
## 7: 7 14 6 2 0 5 2 8 4 0 1 0 0 0 0 2 0
## 8: 6 17 0 1 0 16 12 13 9 0 5 0 0 3 0 7 4
## 9: 3 14 3 3 0 12 12 13 10 2 0 0 0 2 0 1 0
## 10: 1 8 0 2 0 5 4 4 3 1 1 0 0 0 0 3 1
This provides us with a solid foundation for sentiment analysis! We now represent each document by a total of 47 variables. Of course we can think of many more features to include here – we will see some more in the exercise; in general, be creative and see how classification can be improved by adding different explanatory variables.