This demo is about regular expressions (regex).
We show how to perform several basic text mining operations using the systematic regex syntax.
# Load required packages
library(stringr)
With regular expressions there are often several ways to specify the desired pattern. In the following, we will have a look at the most frequently used text mining operations, employing various regex patterns.
You are free to use whatever works but it’s certainly good practice to pick a style and stick to it as it makes your code more easily readable.
In any case do not hesitate to ask Google. Thanks to regex being present in many programming languages there is plenty of help out there.
The stringr
vignette provides a nice intro, and the R regex cheatsheet has all the important stuff nicely arranged at one glance.
We will work with stringr
here as we consider it extensive and user-friendly. Most stringr
functions take a character vector and a pattern.
The most generic command to match regular patterns is probably str_detect
, returning a boolean value indicating whether or not a string contains the specified pattern.
Identify a specific character sequence:
stringr::str_detect(
string = c("Hello world", "bye"),
pattern = "world")
## [1] TRUE FALSE
Identify any character value:
string <- c("999_user_999", "123")
list(
posix = stringr::str_detect(string, pattern = "[:alpha:]"),
ascii = stringr::str_detect(string, pattern = "[A-Za-z]"))
## $posix
## [1] TRUE FALSE
##
## $ascii
## [1] TRUE FALSE
Identify lowercase characters:
string <- c("why ON EARTH", "WHAT!!?!1!")
list(
posix = stringr::str_detect(string, pattern = "[:lower:]"), # [:upper:] for uppercase
ascii = stringr::str_detect(string, pattern = "[a-z]")) # [A-Z] for uppercase
## $posix
## [1] TRUE FALSE
##
## $ascii
## [1] TRUE FALSE
Identify digits:
string <- c("I got 99 problems but regex ain't one", "one two three")
list(
perl = stringr::str_detect(string, pattern = "\\d"),
posix = stringr::str_detect(string, pattern = "[:digit:]"),
ascii = stringr::str_detect(string, pattern = "[0-9]"))
## $perl
## [1] TRUE FALSE
##
## $posix
## [1] TRUE FALSE
##
## $ascii
## [1] TRUE FALSE
Identify punctuation:
stringr::str_detect(
string = c("Well... you're right!", "No punct to be seen here"),
pattern = "[:punct:]")
## [1] TRUE FALSE
Identify spaces:
string <- c("Gimme some space, will ya", "tightstuffhere")
list(
perl = stringr::str_detect(string, pattern = "\\s"),
posix = stringr::str_detect(string, pattern = "[:space:]"))
## $perl
## [1] TRUE FALSE
##
## $posix
## [1] TRUE FALSE
Caution with special characters – we can still query those but we need to “escape” them by prefixing them with a backslash. This tells R that we mean a literal symbol here.
Even more caution with characters that have a special meaning in regex. For example, +
is used to indicate an occurrence of at least one repetition (more on this in a moment). Internally, to give the plus sign its special meaning in the first place, regex prefix them by a backslash also. In order to match +
, we therefore need double backslashes.
Identify special characters:
stringr::str_detect(
string = "Cote d'Azure",
pattern = "\'") # omitting the backslash will throw an error
## [1] TRUE
stringr::str_detect(
string = "Wanna be my +1?",
pattern = "\\+") # omitting the double backslashes will throw an error
## [1] TRUE
Now let’s talk repetitions.
Basically, we have the following possibilities:
?
repeat 0x or 1x*
repeat 0x or more+
repeat 1x or more{n}
repeat exactly n times{n,m}
repeat between n and m times{n,}
repeat at least n times{,m}
repeat at most m timesMatches 3 p’s in a row:
stringr::str_detect(
string = c("nlp_2021", "nlppp_2021"),
pattern = "p{3}")
## [1] FALSE TRUE
Matches 2 dots in a row:
stringr::str_detect(
string = c("never.stop.learning", "never stop learning.."),
pattern = "\\.{2}")
## [1] FALSE TRUE
Matches at most one l:
stringr::str_detect(
string = c("np", "nlp", "nllp", "nllllp"),
pattern = "nl?p")
## [1] TRUE TRUE FALSE FALSE
Matches at least two l’s:
stringr::str_detect(
string = c("np", "nlp", "nllp", "nllllp"),
pattern = "nl{2,}p")
## [1] FALSE FALSE TRUE TRUE
Matches between 1 and 2 l’s:
stringr::str_detect(
string = c("np", "nlp", "nllp", "nllllp"),
pattern = "nl{1,2}p")
## [1] FALSE TRUE TRUE FALSE
Matches one or more quotation marks:
stringr::str_detect(
string = c("What?", "What??", "Huh"),
pattern = "\\?+")
## [1] TRUE TRUE FALSE
str_extract
works similarly but returns the specified string instead of TRUE
or FALSE
.
We will extract character sequences and have a look at groups wrapped in parentheses.
For example, we can extract sequences with some optional parts, which might be helpful in dealing with plurals:
stringr::str_extract(
string = c("Mark my word", "words don't do her justice"),
pattern = "word(s)?")
## [1] "word" "words"
We can also use grouping to specify various accepted options separated by |
:
stringr::str_extract(
string = c("Mum, is there any pudding left?", "No mom, I want fries!", "Mommy please"),
pattern = "(M|m)(u|o)m(my)?")
## [1] "Mum" "mom" "Mommy"
Besides detecting and extracting patterns, we sometimes need to know their exact location within a text. str_locate
returns the beginning and end of the specified pattern (NA
if the pattern is not detected).
Make sure to allow for repetitions if needed:
stringr::str_locate(
string = c("nlp2021", "nlp_kurs"),
pattern = "\\d")
## start end
## [1,] 4 4
## [2,] NA NA
stringr::str_locate(
string = c("nlp2021", "nlp_kurs"),
pattern = "\\d+")
## start end
## [1,] 4 7
## [2,] NA NA
Sometimes we might expect multiple occurrences of our patterns. In that case we use str_locate_all
(str_locate
only returns the location of the first match – the same is true for, e.g., str_replace
and str_replace_all
). When in doubt, better go for _all
.
string <- c("I am Thorin son of Thrain son of Thror, King under the Mountain!")
list(
single = stringr::str_locate(string, pattern = "of"),
multiple = stringr::str_locate_all(string, pattern = "of"))
## $single
## start end
## [1,] 17 18
##
## $multiple
## $multiple[[1]]
## start end
## [1,] 17 18
## [2,] 31 32
Text cleaning will often require replacement of unwanted, leading or trailing sequences.
^
and $
are anchors indicating the beginning and end of a string, respectively.
Replace leading 0’s by an empty character (note that we could also use str_remove_all
here):
stringr::str_replace(
string = c("0005", "1050"),
pattern = "^0+",
replacement = "")
## [1] "5" "1050"
Replace gender suffices by “x”:
stringr::str_replace(
string = c("Innenarchitektin", "Binnenschifffahrtskapitänin", "ProfessorInnen"),
pattern = "(I|i)n(nen)?$",
replacement = "x")
## [1] "Innenarchitektx" "Binnenschifffahrtskapitänx"
## [3] "Professorx"
Besides the simple ^
and $
there is another useful anchoring pattern, namely look-arounds. This allows to match a pattern only if it is preceded or followed by a specific other pattern:
(?=...)
: match if followed by “…”(?<=...)
: match if preceded by “…”=
by !
Replace any word preceded by #
with “hashtag”:
stringr::str_replace_all(
string = c("Ob Laschet Kanzler wird? #idautit"),
pattern = "(?<=#)[:alpha:]+",
replacement = "hashtag")
## [1] "Ob Laschet Kanzler wird? #hashtag"
Remove all numbers except when they are followed by a currency expression:
stringr::str_replace_all(
string = c("I bought 5 apples for 5$"),
pattern = "\\d+(?!\\$)",
replacement = "")
## [1] "I bought apples for 5$"
Or, simply replace occurrences at any location (make sure to use the _all
version):
stringr::str_replace_all(
string = c("Our Father, who art in heaven, hallowed be thy name; thy kingdom come, thy will be done on earth as it is in heaven."),
pattern = "thy",
replacement = "your")
## [1] "Our Father, who art in heaven, hallowed be your name; your kingdom come, your will be done on earth as it is in heaven."
stringr
offers a lot more than we have covered so far.
Some additional operations that might be useful:
stringr::str_split(
string = c("Heufer-Umlauf", "Walter-Borjans"),
pattern = "-")
## [[1]]
## [1] "Heufer" "Umlauf"
##
## [[2]]
## [1] "Walter" "Borjans"
stringr::str_count(
string = c("lorem ipsum, quia dolor sit"),
pattern = "i")
## [1] 3
stringr::str_squish(
string = c(" if you manipulate strings this sometimes leaves annoying whitespaces "))
## [1] "if you manipulate strings this sometimes leaves annoying whitespaces"
As you might have realized by now, handling regex is often trial-and-error (especially when you try to match more complex sequences). They will most certainly drive you crazy at some point or other, but they are a powerful and absolutely essential tool in text mining.