The R community has developed a lot of packages to run off-the-shelf unsupervised sentiment analysis methods, also called dictionary methods. Once a method is published for another language (e.g. Python), it is just a matter of time that an R developer does the R version of the package. Furthermore, there are useful packages to access dictionaries and other useful resources directly from R. In this tutorial, you will see some examples of how to use these resources. You can find the sources to run this yourself in the associated Github repostory.

VADER

As we saw in the unsupervised sentiment analysis topic, VADER is one of the best unsupervised methods to analyze social media text, in particular Twitter. A recently developed R package makes it very easy to run VADER from R. To install and load the package, you just need to do as with any other R package:

install.packages("vader")
library(vader)

The vader package includes a function called get_vader() that will run the VADER method over a text you give it:

get_vader(":) ")
## word_scores    compound         pos         neu         neg   but_count 
##       "{2}"     "0.459"         "1"         "0"         "0"         "0"

The result is an R vector with named entries:
- word_scores: a string that contains an ordered list with the matched scores for each of the words in the text. In the example you can see the negative score for “horrible” and the postive score for “love”. - compound: the final valence compound of VADER for the whole text after applying modifiers and aggregation rules
- pos, neg, and neu: the parts of the compound for positive, negative, and neutral content. These take into account modifiers and are combined when calculating the compound score - but_count: an additional count of “but” since it can complicate the calculation of sentiment

A named vector can be accessed with the bracket operator:

vaderres <- get_vader("This book is horrible, but I love it.")
vaderres["compound"]
## compound 
##  "0.676"

VADER takes into account some punctuation signs, for example an exclamation sign makes a positive word more intense here:

get_vader("This book is horrible, but I love it!")
##                      word_scores                         compound 
## "{0, 0, 0, -1.25, 0, 0, 4.8, 0}"                          "0.704" 
##                              pos                              neu 
##                          "0.425"                          "0.418" 
##                              neg                        but_count 
##                          "0.157"                              "1"

Words can also modify the meaning of other words, for example when amplifiers make it stronger:

get_vader("This book is bad")
##       word_scores          compound               pos               neu 
## "{0, 0, 0, -2.5}"          "-0.542"               "0"           "0.462" 
##               neg         but_count 
##           "0.538"               "0"
get_vader("This book is very bad")
##            word_scores               compound                    pos 
## "{0, 0, 0, 0, -2.793}"               "-0.585"                    "0" 
##                    neu                    neg              but_count 
##                "0.513"                "0.487"                    "0"

Modifiers can make sentiment weaker too, for example the word “slightly” in this case:

get_vader("This book is bad")
##       word_scores          compound               pos               neu 
## "{0, 0, 0, -2.5}"          "-0.542"               "0"           "0.462" 
##               neg         but_count 
##           "0.538"               "0"
get_vader("This book is slightly bad")
##            word_scores               compound                    pos 
## "{0, 0, 0, 0, -2.207}"               "-0.495"                    "0" 
##                    neu                    neg              but_count 
##                "0.555"                "0.445"                    "0"

And negators reverse the valence of a word and weaken it a bit, based on empirical observations reported in the original paper:

get_vader("This book is not bad")
##          word_scores             compound                  pos 
## "{0, 0, 0, 0, 1.85}"              "0.431"              "0.416" 
##                  neu                  neg            but_count 
##              "0.584"                  "0"                  "0"

The vader package also includes a vader_df() function to run VADER over a series of texts and produce results in a data frame. Here we use the schrute library to load the scripts of the US TV series “The Office” and run VADER over the first six lines of the series:

library(schrute)
texts <- head(theoffice$text)
vader_df(texts)
##                                                                                                                                                                                                                                                                                                                                                                                                                                                      text
## 1                                                                                                                                                                                                                                                                                                                                                                          All right Jim. Your quarterlies look very good. How are things at the library?
## 2                                                                                                                                                                                                                                                                                                                                                                                                              Oh, I told you. I couldn't close it. So...
## 3                                                                                                                                                                                                                                                                                                                                                                     So you've come to the master for guidance? Is this what you're saying, grasshopper?
## 4                                                                                                                                                                                                                                                                                                                                                                                                              Actually, you called me in here, but yeah.
## 5                                                                                                                                                                                                                                                                                                                                                                                                         All right. Well, let me show you how it's done.
## 6 Yes, I'd like to speak to your office manager, please. Yes, hello. This is Michael Scott. I am the Regional Manager of Dunder Mifflin Paper Products. Just wanted to talk to you manager-a-manger. All right. Done deal. Thank you very much, sir. You're a gentleman and a scholar. Oh, I'm sorry. OK. I'm sorry. My mistake. That was a woman I was talking to, so... She had a very low voice. Probably a smoker, so... So that's the way it's done.
##                                                                                                                                                                                                                                                                       word_scores
## 1                                                                                                                                                                                                                                  {0, 0, 0, 0, 0, 0, 0, 2.193, 0, 0, 0, 0, 0, 0}
## 2                                                                                                                                                                                                                                                     {0, 0, 0, 0, 0, 0, 0, 0, 0}
## 3                                                                                                                                                                                                                                      {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
## 4                                                                                                                                                                                                                                                      {0, 0, 0, 0, 0, 0, 0, 1.8}
## 5                                                                                                                                                                                                                                                {0, 0, 1.1, 0, 0, 0, 0, 0, 0, 0}
## 6 {1.7, 0, 1.5, 0, 0, 0, 0, 0, 0, 1.3, 1.7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -0.3, 2.133, 0, -0.3, 0, -1.4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1.393, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
##   compound   pos   neu   neg but_count
## 1    0.493 0.197 0.803 0.000         0
## 2    0.000 0.000 1.000 0.000         0
## 3    0.000 0.000 1.000 0.000         0
## 4    0.421 0.286 0.714 0.000         1
## 5    0.273 0.189 0.811 0.000         0
## 6    0.857 0.168 0.754 0.078         0

Due to the modification rules, VADER is not as fast as other dictionary-based methods and running it for the whole theoffice dataset can take from minutes to hours depending on your computer. However, the vader_df function makes it much faster than a loop.

Syuzhet

Syuzhet is a dictionary developed by (Matthew Jockers](https://www.matthewjockers.net/2015/02/02/syuzhet/) to analyze the arcs of sentiment in novels. The R package syuzhet makes it very easy to use and includes other sentiment dictionaries too:

install.packages("syuzhet")
library(syuzhet)

The syuzhet package provides the get_sentiment() function that works the same way as the get_vader() function, but runs the syuzhet lexicon and computes a mean over the words of the text:

get_sentiment("This book is horrible")
## [1] -0.75
get_sentiment("This book is horrible, but I love it")
## [1] 0
get_sentiment("This book is horrible, but I love it!")
## [1] 0

As you see in the examples above, the first sentence is scored as negative, but the coexistence of a negative and a positive word in the second example makes them cancel out and give a zero. There are no modification rules, so having an exclamation mark makes no difference.

The get_sentiment() function can be run for a vector with more than one text, making it simple to get many scores:

get_sentiment(head(theoffice$text))
## [1] 2.15 0.00 0.50 0.00 1.60 4.60

syuzhet also includes other dictionaries, like the popular AFINN dictionary. In AFINN, words are scored in the same scale as SentiStrength, as integers from -5 to +5, and thus the output looks a bit different:

get_sentiment(head(theoffice$text), method="afinn")
## [1] 3 0 0 1 0 3

When combined with dplyr data wrangling tools, packages like syuzhet make it very easy to compute sentiment aggregates. Here we run Syuzhet over all the lines in The Office, since it’s much faster than VADER. We make a ranking by mean sentiment over the main characters of the series:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
theoffice$sentiment <- get_sentiment(theoffice$text)
theoffice %>% 
  group_by(character) %>% 
  summarise(sent=mean(sentiment), n=n()) %>% 
  arrange(desc(n)) %>% head(n=20) %>% 
  arrange(desc(sent))
## # A tibble: 20 × 3
##    character  sent     n
##    <chr>     <dbl> <int>
##  1 Michael   0.336 10921
##  2 Andy      0.280  3754
##  3 Nellie    0.260   529
##  4 Jim       0.250  6303
##  5 Gabe      0.237   427
##  6 Jan       0.234   810
##  7 Pam       0.231  5031
##  8 Ryan      0.222  1198
##  9 Darryl    0.217  1182
## 10 Holly     0.208   555
## 11 Erin      0.199  1440
## 12 Toby      0.199   818
## 13 Angela    0.181  1569
## 14 Phyllis   0.178   970
## 15 Kevin     0.176  1564
## 16 Oscar     0.174  1368
## 17 Dwight    0.172  6847
## 18 Kelly     0.163   841
## 19 Stanley   0.120   678
## 20 Meredith  0.103   559

Tidytext

tidytext is a very useful package that follows the same philosophy as dplyr but for text. It is very powerful for text analysis, including sentiment analysis.

install.packages("tidytext")
library(tidytext)

When you run some tidytext functions, it might ask you to install additional packages like “textdata” or download resources like the sentiment dictionaries. The plain installation keeps these additional resources to a minimum and you will be asked to install them only once the first time you use them.

Tidytext, as other dplyr-related packages, usest tibbles, which is an upgraded version of a data frame. Converting plain text to a tibble is rather simple. In this example we create a tibble with the text of the lines of the theoffice dataset and add a column with the name of the character that says the line:

texts <- theoffice$text
head(texts)
## [1] "All right Jim. Your quarterlies look very good. How are things at the library?"                                                                                                                                                                                                                                                                                                                                                                         
## [2] "Oh, I told you. I couldn't close it. So..."                                                                                                                                                                                                                                                                                                                                                                                                             
## [3] "So you've come to the master for guidance? Is this what you're saying, grasshopper?"                                                                                                                                                                                                                                                                                                                                                                    
## [4] "Actually, you called me in here, but yeah."                                                                                                                                                                                                                                                                                                                                                                                                             
## [5] "All right. Well, let me show you how it's done."                                                                                                                                                                                                                                                                                                                                                                                                        
## [6] "Yes, I'd like to speak to your office manager, please. Yes, hello. This is Michael Scott. I am the Regional Manager of Dunder Mifflin Paper Products. Just wanted to talk to you manager-a-manger. All right. Done deal. Thank you very much, sir. You're a gentleman and a scholar. Oh, I'm sorry. OK. I'm sorry. My mistake. That was a woman I was talking to, so... She had a very low voice. Probably a smoker, so... So that's the way it's done."
textdf <- tibble(character=theoffice$character, text=texts)
head(textdf)
## # A tibble: 6 × 2
##   character text                                                                
##   <chr>     <chr>                                                               
## 1 Michael   All right Jim. Your quarterlies look very good. How are things at t…
## 2 Jim       Oh, I told you. I couldn't close it. So...                          
## 3 Michael   So you've come to the master for guidance? Is this what you're sayi…
## 4 Jim       Actually, you called me in here, but yeah.                          
## 5 Michael   All right. Well, let me show you how it's done.                     
## 6 Michael   Yes, I'd like to speak to your office manager, please. Yes, hello. …

Once you have your text in a tibble, you can tokenize the texts (i.e. separating it into words) with the unnest_tokens function. This function creates a column with the name of its first parameter to put the tokens of the text specified in its second parameter:

textdf %>% 
  unnest_tokens(word, text) -> wordsdf
head(wordsdf)
## # A tibble: 6 × 2
##   character word       
##   <chr>     <chr>      
## 1 Michael   all        
## 2 Michael   right      
## 3 Michael   jim        
## 4 Michael   your       
## 5 Michael   quarterlies
## 6 Michael   look

You can combine it with the count() function of dplyr to make a table of word frequencies in the whole theoffice corpus:

wordsdf %>% 
  count(word) %>% 
  arrange(desc(n))
## # A tibble: 19,631 × 2
##    word      n
##    <chr> <int>
##  1 i     20796
##  2 you   19987
##  3 the   14525
##  4 to    13521
##  5 a     12965
##  6 and    8841
##  7 it     8382
##  8 that   7794
##  9 is     7426
## 10 of     6344
## # … with 19,621 more rows

Tidytext includes some useful datasets, for example the stop_words dataset includes stop words that barely contain meaning in English. You can combine it with the anti_join() function to remove them from the tokens tibble:

data(stop_words)
wordsdf %>% 
  anti_join(stop_words)
## Joining with `by = join_by(word)`
## # A tibble: 170,168 × 2
##    character word       
##    <chr>     <chr>      
##  1 Michael   jim        
##  2 Michael   quarterlies
##  3 Michael   library    
##  4 Jim       told       
##  5 Jim       close      
##  6 Michael   master     
##  7 Michael   guidance   
##  8 Michael   grasshopper
##  9 Jim       called     
## 10 Jim       yeah       
## # … with 170,158 more rows

You can see the effect of removing stopwords in the ranking we calculated earlier:

wordsdf %>% 
  anti_join(stop_words)  %>% 
  count(word) %>% 
  arrange(desc(n))
## Joining with `by = join_by(word)`
## # A tibble: 18,960 × 2
##    word        n
##    <chr>   <int>
##  1 yeah     2930
##  2 hey      2232
##  3 michael  1860
##  4 uh       1463
##  5 gonna    1405
##  6 dwight   1345
##  7 jim      1162
##  8 time     1149
##  9 pam      1043
## 10 guys      947
## # … with 18,950 more rows

You can do this for any word list, you can specify your own one and count occurences of words in the list. For example, you can calculate the normalized frequency of the words “uh”, “hey”, and “um” across characters that say more than 5000 words:

wordlist <- data.frame(word=c("uh","hey","um"))
wordsdf %>% 
  inner_join(wordlist)  %>% 
  group_by(character) %>%
  summarize(nuh=n()) -> uhcount
## Joining with `by = join_by(word)`
wordsdf %>% 
  group_by(character) %>%
  summarize(n=n()) -> charcount

inner_join(uhcount,charcount) %>% filter(n>5000) %>% mutate(ratioUh = nuh/n) %>% arrange(desc(ratioUh))
## Joining with `by = join_by(character)`
## # A tibble: 19 × 4
##    character   nuh      n ratioUh
##    <chr>     <int>  <int>   <dbl>
##  1 Toby        156   7809 0.0200 
##  2 Pam         539  44962 0.0120 
##  3 Jim         671  57529 0.0117 
##  4 Ryan        128  11785 0.0109 
##  5 Gabe         60   5612 0.0107 
##  6 Darryl       99  11099 0.00892
##  7 Jan          64   7858 0.00814
##  8 Erin         98  12816 0.00765
##  9 Michael    1032 147019 0.00702
## 10 Andy        294  43808 0.00671
## 11 Oscar        74  11813 0.00626
## 12 Kevin        68  12416 0.00548
## 13 Phyllis      36   7467 0.00482
## 14 Dwight      358  74411 0.00481
## 15 Angela       55  13409 0.00410
## 16 Kelly        34   9066 0.00375
## 17 Nellie       21   6836 0.00307
## 18 Stanley      16   5665 0.00282
## 19 Robert        7   5665 0.00124

You can learn more about tidytext at https://www.tidytextmining.com/

Using your own lexicon

You can run yourself any dictionary method through tidytext as long as you have a file with the words and their scores or sentiment classifications. tidytext includes three sentiment lexica with three different kind of annotations. You can get the annotations of each word with the get_sentiments() function. When running the lines below, you will have to say “yes” to download the data files:

get_sentiments("afinn")
## # A tibble: 2,477 × 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # … with 2,467 more rows
get_sentiments("bing")
## # A tibble: 6,786 × 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # … with 6,776 more rows

As you see, AFINN has a numerical score per word, bing has words in classes of positivity and negativity, and the NRC lexicon maps words to emotions (these are emotions from Plutchik’s wheel, you can learn more about it in the measuring emotions topic).

You can easily match words in texts to these lexica with tidytext. The code below takes the tokens data frame, matches words with the corresponding scores in the AFINN lexicon, and calculates a ranking of main characters in The Office by the mean sentiment score of the words they say.

wordsdf %>% 
  inner_join(get_sentiments("afinn")) %>%
  group_by(character) %>%
  summarize(sent=mean(value), n=n()) %>%
  arrange(desc(n)) %>%
  head(20) %>%
  arrange(desc(sent)) -> afinndf
## Joining with `by = join_by(word)`
afinndf
## # A tibble: 20 × 3
##    character  sent     n
##    <chr>     <dbl> <int>
##  1 Pam       0.802  3946
##  2 Jim       0.773  4674
##  3 Ryan      0.745   905
##  4 Holly     0.703   438
##  5 Erin      0.697  1130
##  6 Phyllis   0.693   583
##  7 Robert    0.675   461
##  8 Andy      0.643  3609
##  9 Gabe      0.621   443
## 10 Michael   0.607 12601
## 11 Jan       0.590   614
## 12 Kevin     0.582  1120
## 13 Darryl    0.575   954
## 14 Toby      0.542   565
## 15 Oscar     0.533   885
## 16 Kelly     0.454   795
## 17 Dwight    0.366  6095
## 18 Nellie    0.333   574
## 19 Angela    0.304  1176
## 20 Stanley   0.260   415

As you see, even the most negative character (Stanley) still has a positive mean sentiment score. This is called the postivity bias of language and you can see more examples about it if you run other sentiment analysis methods.

in the Twitter sentiment exercise.

You can do the same as above but based on a file. Saif Mohammad developed the NRC lexica and one of the latests is the NRC Valence, Arousal, and Dominance lexicon (NRC-VAD). The following code downloads the NRC-VAD files and unzips them in your local folder.

#download.file("https://saifmohammad.com/WebDocs/VAD/NRC-VAD-Lexicon-Aug2018Release.zip", destfile="NRCVAD.zip")
#unzip("NRCVAD.zip")

To use the lexicon, we have to read the valence file in English, name its columns, and convert it into a tibble:

Valencedf <- read.table("NRC-VAD-Lexicon-Aug2018Release/OneFilePerDimension/v-scores.txt", header=F, sep="\t")
names(Valencedf) <- c("word","valence")
vdf <- tibble(Valencedf)

Then in the same way as when we used AFINN, we can calculate the mean valence of words said by each character in The Office:

wordsdf %>% 
  inner_join(vdf) %>%
  group_by(character) %>%
  summarize(meanvalence=mean(valence), n=n()) %>%
  arrange(desc(n)) %>%
  head(20) %>%
  arrange(desc(meanvalence)) -> nrcdf
## Joining with `by = join_by(word)`
nrcdf
## # A tibble: 20 × 3
##    character meanvalence     n
##    <chr>           <dbl> <int>
##  1 Pam             0.639 13646
##  2 Phyllis         0.636  2287
##  3 Ryan            0.633  3722
##  4 Holly           0.632  1461
##  5 Jim             0.631 17598
##  6 Robert          0.628  1901
##  7 Jan             0.627  2388
##  8 Erin            0.626  3948
##  9 Michael         0.626 46421
## 10 Darryl          0.622  3699
## 11 Toby            0.622  2369
## 12 Andy            0.620 14237
## 13 Kelly           0.620  2832
## 14 Kevin           0.618  3919
## 15 Angela          0.618  4311
## 16 Stanley         0.615  1890
## 17 Oscar           0.611  3706
## 18 Gabe            0.610  1840
## 19 Nellie          0.609  2241
## 20 Dwight          0.600 25246

You can see that the ranking is somehow similar. We can compare the mean sentiment per character in AFINN and the NRC-VAD lexicon with a scatterplot:

joindf <- inner_join(nrcdf, afinndf, by="character")
plot(joindf$meanvalence, joindf$sent, type="n", xlab="NRC Valence", ylab="AFINN score")
text(joindf$meanvalence, joindf$sent, joindf$character)

As you see, there is some correlation but still quite some disagreement for some characters, especially for the ones with the most negative means of sentiment. The correlation coefficient can be calculated like this:

cor.test(joindf$meanvalence, joindf$sent)
## 
##  Pearson's product-moment correlation
## 
## data:  joindf$meanvalence and joindf$sent
## t = 4.9896, df = 18, p-value = 9.495e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4817028 0.9007010
## sample estimates:
##       cor 
## 0.7618294

Something about 0.75 is not so strong given that these two measurements should be very similar. There is a lot of research going on comparing sentiment analyses like these, the most important is to choose one that can be validated for your application case as you can learn in the validating sentiment analsysis exercise.

Sentiment analysis with Transformer models

Run this part in an interactive R console rather than in the markdown file or knitting

You can use off-the-shelf state-of-the-art sentiment analysis models based on the Transformers architecture (which ChatGPT uses). You need to install the reticulate package, which allows you to install a way to use python code within R called miniconda. You only need to run this install_miniconda once and it will be set up.

install.packages("reticulate")
library(reticulate)
install_miniconda()

Once you have that set up, you can install the latest version of the huggingfaceR package, which lets you use models stored in the HuggingFace repository. To set up this link to HuggingFace, you need to run hf_python_depends() to download dependencies within Python, so you can just focus on your R code.

devtools::install_github("farach/huggingfaceR")
library(huggingfaceR)
hf_python_depends() 

Now you can download and run models from HuggingFace. For example, you can use DistilBERT for text classification into sentiment:

library(huggingfaceR)

distilBERT <- hf_load_pipeline(
    model_id = "distilbert-base-uncased-finetuned-sst-2-english", 
    task = "text-classification"
    )

distilBERT("I like you. I love you")

Or LEIA (Linguistic Embeddings for the Identification of Affect) for emotion identification:

LEIA <- hf_load_pipeline(
    model_id = "LEIA/LEIA-base", 
    task = "text-classification"
    )

LEIA("I am so angry right now.")

If you want to learn more about LEIA, check our recent paper about it.