Handling Regular Expressions

This demo is about regular expressions (regex).

We show how to perform several basic text mining operations using the systematic regex syntax.

# Load required packages

library(stringr)

Disclaimer

With regular expressions there are often several ways to specify the desired pattern. In the following, we will have a look at the most frequently used text mining operations, employing various regex patterns.

You are free to use whatever works but it’s certainly good practice to pick a style and stick to it as it makes your code more easily readable.

In any case do not hesitate to ask Google. Thanks to regex being present in many programming languages there is plenty of help out there.

The stringr vignette provides a nice intro, and the R regex cheatsheet has all the important stuff nicely arranged at one glance.

Identify

We will work with stringr here as we consider it extensive and user-friendly. Most stringr functions take a character vector and a pattern.

The most generic command to match regular patterns is probably str_detect, returning a boolean value indicating whether or not a string contains the specified pattern.

Identify a specific character sequence:

stringr::str_detect(
  string = c("Hello world", "bye"), 
  pattern = "world")
## [1]  TRUE FALSE

Identify any character value:

string <- c("999_user_999", "123")

list(
  posix = stringr::str_detect(string, pattern = "[:alpha:]"),
  ascii = stringr::str_detect(string, pattern = "[A-Za-z]"))
## $posix
## [1]  TRUE FALSE
## 
## $ascii
## [1]  TRUE FALSE

Identify lowercase characters:

string <- c("why ON EARTH", "WHAT!!?!1!")

list(
  posix = stringr::str_detect(string, pattern = "[:lower:]"), # [:upper:] for uppercase
  ascii = stringr::str_detect(string, pattern = "[a-z]")) # [A-Z] for uppercase
## $posix
## [1]  TRUE FALSE
## 
## $ascii
## [1]  TRUE FALSE

Identify digits:

string <- c("I got 99 problems but regex ain't one", "one two three")

list(
  perl = stringr::str_detect(string, pattern = "\\d"),
  posix = stringr::str_detect(string, pattern = "[:digit:]"),
  ascii = stringr::str_detect(string, pattern = "[0-9]"))
## $perl
## [1]  TRUE FALSE
## 
## $posix
## [1]  TRUE FALSE
## 
## $ascii
## [1]  TRUE FALSE

Identify punctuation:

stringr::str_detect(
  string = c("Well... you're right!", "No punct to be seen here"), 
  pattern = "[:punct:]")
## [1]  TRUE FALSE

Identify spaces:

string <- c("Gimme some space, will ya", "tightstuffhere")

list(
  perl = stringr::str_detect(string, pattern = "\\s"),
  posix = stringr::str_detect(string, pattern = "[:space:]"))
## $perl
## [1]  TRUE FALSE
## 
## $posix
## [1]  TRUE FALSE

Caution with special characters – we can still query those but we need to “escape” them by prefixing them with a backslash. This tells R that we mean a literal symbol here.

Even more caution with characters that have a special meaning in regex. For example, + is used to indicate an occurrence of at least one repetition (more on this in a moment). Internally, to give the plus sign its special meaning in the first place, regex prefix them by a backslash also. In order to match +, we therefore need double backslashes.

Identify special characters:

stringr::str_detect(
  string = "Cote d'Azure", 
  pattern = "\'") # omitting the backslash will throw an error
## [1] TRUE
stringr::str_detect(
  string = "Wanna be my +1?", 
  pattern = "\\+") # omitting the double backslashes will throw an error
## [1] TRUE

Now let’s talk repetitions.

Basically, we have the following possibilities:

  • ? repeat 0x or 1x
  • * repeat 0x or more
  • + repeat 1x or more
  • {n} repeat exactly n times
  • {n,m} repeat between n and m times
  • {n,} repeat at least n times
  • {,m} repeat at most m times

Matches 3 p’s in a row:

stringr::str_detect(
  string  = c("nlp_2021", "nlppp_2021"),
  pattern = "p{3}")
## [1] FALSE  TRUE

Matches 2 dots in a row:

stringr::str_detect(
  string  = c("never.stop.learning", "never stop learning.."),
  pattern = "\\.{2}")
## [1] FALSE  TRUE

Matches at most one l:

stringr::str_detect(
  string = c("np", "nlp", "nllp", "nllllp"),
  pattern = "nl?p")
## [1]  TRUE  TRUE FALSE FALSE

Matches at least two l’s:

stringr::str_detect(
  string = c("np", "nlp", "nllp", "nllllp"),
  pattern = "nl{2,}p")
## [1] FALSE FALSE  TRUE  TRUE

Matches between 1 and 2 l’s:

stringr::str_detect(
  string = c("np", "nlp", "nllp", "nllllp"),
  pattern = "nl{1,2}p")
## [1] FALSE  TRUE  TRUE FALSE

Matches one or more quotation marks:

stringr::str_detect(
  string = c("What?", "What??", "Huh"),
  pattern = "\\?+")
## [1]  TRUE  TRUE FALSE

Extract

str_extract works similarly but returns the specified string instead of TRUE or FALSE.

We will extract character sequences and have a look at groups wrapped in parentheses.

For example, we can extract sequences with some optional parts, which might be helpful in dealing with plurals:

stringr::str_extract(
  string = c("Mark my word", "words don't do her justice"),
  pattern = "word(s)?")
## [1] "word"  "words"

We can also use grouping to specify various accepted options separated by |:

stringr::str_extract(
  string = c("Mum, is there any pudding left?", "No mom, I want fries!", "Mommy please"),
  pattern = "(M|m)(u|o)m(my)?")
## [1] "Mum"   "mom"   "Mommy"

Locate

Besides detecting and extracting patterns, we sometimes need to know their exact location within a text. str_locate returns the beginning and end of the specified pattern (NA if the pattern is not detected).

Make sure to allow for repetitions if needed:

stringr::str_locate(
  string = c("nlp2021", "nlp_kurs"),
  pattern = "\\d")
##      start end
## [1,]     4   4
## [2,]    NA  NA
stringr::str_locate(
  string = c("nlp2021", "nlp_kurs"),
  pattern = "\\d+")
##      start end
## [1,]     4   7
## [2,]    NA  NA

Sometimes we might expect multiple occurrences of our patterns. In that case we use str_locate_all (str_locate only returns the location of the first match – the same is true for, e.g., str_replace and str_replace_all). When in doubt, better go for _all.

string <- c("I am Thorin son of Thrain son of Thror, King under the Mountain!")

list(
  single = stringr::str_locate(string, pattern = "of"),
  multiple = stringr::str_locate_all(string, pattern = "of"))
## $single
##      start end
## [1,]    17  18
## 
## $multiple
## $multiple[[1]]
##      start end
## [1,]    17  18
## [2,]    31  32

Replace

Text cleaning will often require replacement of unwanted, leading or trailing sequences.

^ and $ are anchors indicating the beginning and end of a string, respectively.

Replace leading 0’s by an empty character (note that we could also use str_remove_all here):

stringr::str_replace(
  string = c("0005", "1050"),
  pattern = "^0+",
  replacement = "")
## [1] "5"    "1050"

Replace gender suffices by “x”:

stringr::str_replace(
  string = c("Innenarchitektin", "Binnenschifffahrtskapitänin", "ProfessorInnen"),
  pattern = "(I|i)n(nen)?$",
  replacement = "x")
## [1] "Innenarchitektx"            "Binnenschifffahrtskapitänx"
## [3] "Professorx"

Besides the simple ^ and $ there is another useful anchoring pattern, namely look-arounds. This allows to match a pattern only if it is preceded or followed by a specific other pattern:

  • Look-ahead (?=...): match if followed by “…”
  • Look-behind (?<=...): match if preceded by “…”
  • Negated version for both: replace = by !

Replace any word preceded by # with “hashtag”:

stringr::str_replace_all(
  string = c("Ob Laschet Kanzler wird? #idautit"),
  pattern = "(?<=#)[:alpha:]+",
  replacement = "hashtag")
## [1] "Ob Laschet Kanzler wird? #hashtag"

Remove all numbers except when they are followed by a currency expression:

stringr::str_replace_all(
  string = c("I bought 5 apples for 5$"),
  pattern = "\\d+(?!\\$)",
  replacement = "")
## [1] "I bought  apples for 5$"

Or, simply replace occurrences at any location (make sure to use the _all version):

stringr::str_replace_all(
  string = c("Our Father, who art in heaven, hallowed be thy name; thy kingdom come, thy will be done on earth as it is in heaven."),
  pattern = "thy",
  replacement = "your")
## [1] "Our Father, who art in heaven, hallowed be your name; your kingdom come, your will be done on earth as it is in heaven."

Other useful operations

stringr offers a lot more than we have covered so far.

Some additional operations that might be useful:

stringr::str_split(
  string = c("Heufer-Umlauf", "Walter-Borjans"),
  pattern = "-")
## [[1]]
## [1] "Heufer" "Umlauf"
## 
## [[2]]
## [1] "Walter"  "Borjans"
stringr::str_count(
  string = c("lorem ipsum, quia dolor sit"),
  pattern = "i")
## [1] 3
stringr::str_squish(
  string = c("   if  you manipulate   strings this sometimes leaves  annoying whitespaces   "))
## [1] "if you manipulate strings this sometimes leaves annoying whitespaces"

As you might have realized by now, handling regex is often trial-and-error (especially when you try to match more complex sequences). They will most certainly drive you crazy at some point or other, but they are a powerful and absolutely essential tool in text mining.