The R community has developed a lot of packages to run off-the-shelf unsupervised sentiment analysis methods, also called dictionary methods. Once a method is published for another language (e.g. Python), it is just a matter of time that an R developer does the R version of the package. Furthermore, there are useful packages to access dictionaries and other useful resources directly from R. In this tutorial, you will see some examples of how to use these resources. You can find the sources to run this yourself in the associated Github repostory.

VADER

As we saw in the unsupervised sentiment analysis topic, VADER is one of the best unsupervised methods to analyze social media text, in particular Twitter. A recently developed R package makes it very easy to run VADER from R. To install and load the package, you just need to do as with any other R package:

install.packages("vader")
library(vader)

The vader package includes a function called get_vader() that will run the VADER method over a text you give it:

get_vader("This book is horrible, but I love it.")
##                      word_scores                         compound 
## "{0, 0, 0, -1.25, 0, 0, 4.8, 0}"                          "0.676" 
##                              pos                              neu 
##                          "0.413"                          "0.427" 
##                              neg                        but_count 
##                           "0.16"                              "1"

The result is an R vector with named entries:
- word_scores: a string that contains an ordered list with the matched scores for each of the words in the text. In the example you can see the negative score for “horrible” and the postive score for “love”. - compound: the final valence compound of VADER for the whole text after applying modifiers and aggregation rules
- pos, neg, and neu: the parts of the compound for positive, negative, and neutral content. These take into account modifiers and are combined when calculating the compound score - but_count: an additional count of “but” since it can complicate the calculation of sentiment

A named vector can be accessed with the bracket operator:

vaderres <- get_vader("This book is horrible, but I love it.")
vaderres["compound"]
## compound 
##  "0.676"

VADER takes into account some punctuation signs, for example an exclamation sign makes a positive word more intense here:

get_vader("This book is horrible, but I love it!")
##                      word_scores                         compound 
## "{0, 0, 0, -1.25, 0, 0, 4.8, 0}"                          "0.704" 
##                              pos                              neu 
##                          "0.425"                          "0.418" 
##                              neg                        but_count 
##                          "0.157"                              "1"

Words can also modify the meaning of other words, for example when amplifiers make it stronger:

get_vader("This book is bad")
##       word_scores          compound               pos               neu 
## "{0, 0, 0, -2.5}"          "-0.542"               "0"           "0.462" 
##               neg         but_count 
##           "0.538"               "0"
get_vader("This book is very bad")
##            word_scores               compound                    pos 
## "{0, 0, 0, 0, -2.793}"               "-0.585"                    "0" 
##                    neu                    neg              but_count 
##                "0.513"                "0.487"                    "0"

Modifiers can make sentiment weaker too, for example the word “slightly” in this case:

get_vader("This book is bad")
##       word_scores          compound               pos               neu 
## "{0, 0, 0, -2.5}"          "-0.542"               "0"           "0.462" 
##               neg         but_count 
##           "0.538"               "0"
get_vader("This book is slightly bad")
##            word_scores               compound                    pos 
## "{0, 0, 0, 0, -2.207}"               "-0.495"                    "0" 
##                    neu                    neg              but_count 
##                "0.555"                "0.445"                    "0"

And negators reverse the valence of a word and weaken it a bit, based on empirical observations reported in the original paper:

get_vader("This book is not bad")
##          word_scores             compound                  pos 
## "{0, 0, 0, 0, 1.85}"              "0.431"              "0.416" 
##                  neu                  neg            but_count 
##              "0.584"                  "0"                  "0"

The vader package also includes a vader_df() function to run VADER over a series of texts and produce results in a data frame. Here we use the schrute library to load the scripts of the US TV series “The Office” and run VADER over the first six lines of the series:

library(schrute)
texts <- head(theoffice$text)
vader_df(texts)
##                                                                                                                                                                                                                                                                                                                                                                                                                                                      text
## 1                                                                                                                                                                                                                                                                                                                                                                          All right Jim. Your quarterlies look very good. How are things at the library?
## 2                                                                                                                                                                                                                                                                                                                                                                                                              Oh, I told you. I couldn't close it. So...
## 3                                                                                                                                                                                                                                                                                                                                                                     So you've come to the master for guidance? Is this what you're saying, grasshopper?
## 4                                                                                                                                                                                                                                                                                                                                                                                                              Actually, you called me in here, but yeah.
## 5                                                                                                                                                                                                                                                                                                                                                                                                         All right. Well, let me show you how it's done.
## 6 Yes, I'd like to speak to your office manager, please. Yes, hello. This is Michael Scott. I am the Regional Manager of Dunder Mifflin Paper Products. Just wanted to talk to you manager-a-manger. All right. Done deal. Thank you very much, sir. You're a gentleman and a scholar. Oh, I'm sorry. OK. I'm sorry. My mistake. That was a woman I was talking to, so... She had a very low voice. Probably a smoker, so... So that's the way it's done.
##                                                                                                                                                                                                                                                                       word_scores
## 1                                                                                                                                                                                                                                  {0, 0, 0, 0, 0, 0, 0, 2.193, 0, 0, 0, 0, 0, 0}
## 2                                                                                                                                                                                                                                                     {0, 0, 0, 0, 0, 0, 0, 0, 0}
## 3                                                                                                                                                                                                                                      {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
## 4                                                                                                                                                                                                                                                      {0, 0, 0, 0, 0, 0, 0, 1.8}
## 5                                                                                                                                                                                                                                                {0, 0, 1.1, 0, 0, 0, 0, 0, 0, 0}
## 6 {1.7, 0, 1.5, 0, 0, 0, 0, 0, 0, 1.3, 1.7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -0.3, 2.133, 0, -0.3, 0, -1.4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1.393, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
##   compound   pos   neu   neg but_count
## 1    0.493 0.197 0.803 0.000         0
## 2    0.000 0.000 1.000 0.000         0
## 3    0.000 0.000 1.000 0.000         0
## 4    0.421 0.286 0.714 0.000         1
## 5    0.273 0.189 0.811 0.000         0
## 6    0.857 0.168 0.754 0.078         0

Due to the modification rules, VADER is not as fast as other dictionary-based methods and running it for the whole theoffice dataset can take from minutes to hours depending on your computer. However, the vader_df function makes it much faster than a loop.

Syuzhet

Syuzhet is a dictionary developed by (Matthew Jockers](https://www.matthewjockers.net/2015/02/02/syuzhet/) to analyze the arcs of sentiment in novels. The R package syuzhet makes it very easy to use and includes other sentiment dictionaries too:

install.packages("syuzhet")
library(syuzhet)

The syuzhet package provides the get_sentiment() function that works the same way as the get_vader() function, but runs the syuzhet lexicon and computes a mean over the words of the text:

get_sentiment("This book is horrible")
## [1] -0.75
get_sentiment("This book is horrible, but I love it")
## [1] 0
get_sentiment("This book is horrible, but I love it!")
## [1] 0

As you see in the examples above, the first sentence is scored as negative, but the coexistence of a negative and a positive word in the second example makes them cancel out and give a zero. There are no modification rules, so having an exclamation mark makes no difference.

The get_sentiment() function can be run for a vector with more than one text, making it simple to get many scores:

get_sentiment(head(theoffice$text))
## [1] 2.15 0.00 0.50 0.00 1.60 4.60

syuzhet also includes other dictionaries, like the popular AFINN dictionary. In AFINN, words are scored in the same scale as SentiStrength, as integers from -5 to +5, and thus the output looks a bit different:

get_sentiment(head(theoffice$text), method="afinn")
## [1] 3 0 0 1 0 3

When combined with dplyr data wrangling tools, packages like syuzhet make it very easy to compute sentiment aggregates. Here we run Syuzhet over all the lines in The Office, since it’s much faster than VADER. We make a ranking by mean sentiment over the main characters of the series:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
theoffice$sentiment <- get_sentiment(theoffice$text)
theoffice %>% 
  group_by(character) %>% 
  summarise(sent=mean(sentiment), n=n()) %>% 
  arrange(desc(n)) %>% head(n=20) %>% 
  arrange(desc(sent))
## # A tibble: 20 x 3
##    character  sent     n
##    <chr>     <dbl> <int>
##  1 Michael   0.335 10921
##  2 Andy      0.280  3754
##  3 Nellie    0.260   529
##  4 Jim       0.249  6303
##  5 Gabe      0.237   427
##  6 Jan       0.232   810
##  7 Pam       0.231  5031
##  8 Ryan      0.222  1198
##  9 Darryl    0.217  1182
## 10 Holly     0.208   555
## 11 Toby      0.200   818
## 12 Erin      0.199  1440
## 13 Angela    0.181  1569
## 14 Phyllis   0.179   970
## 15 Kevin     0.176  1564
## 16 Oscar     0.174  1368
## 17 Dwight    0.171  6847
## 18 Kelly     0.161   841
## 19 Stanley   0.120   678
## 20 Meredith  0.103   559

Tidytext

tidytext is a very useful package that follows the same philosophy as dplyr but for text. It is very powerful for text analysis, including sentiment analysis.

install.packages("tidytext")
library(tidytext)

When you run some tidytext functions, it might ask you to install additional packages like “textdata” or download resources like the sentiment dictionaries. The plain installation keeps these additional resources to a minimum and you will be asked to install them only once the first time you use them.

Tidytext, as other dplyr-related packages, usest tibbles, which is an upgraded version of a data frame. Converting plain text to a tibble is rather simple. In this example we create a tibble with the text of the lines of the theoffice dataset and add a column with the name of the character that says the line:

texts <- theoffice$text
head(texts)
## [1] "All right Jim. Your quarterlies look very good. How are things at the library?"                                                                                                                                                                                                                                                                                                                                                                         
## [2] "Oh, I told you. I couldn't close it. So..."                                                                                                                                                                                                                                                                                                                                                                                                             
## [3] "So you've come to the master for guidance? Is this what you're saying, grasshopper?"                                                                                                                                                                                                                                                                                                                                                                    
## [4] "Actually, you called me in here, but yeah."                                                                                                                                                                                                                                                                                                                                                                                                             
## [5] "All right. Well, let me show you how it's done."                                                                                                                                                                                                                                                                                                                                                                                                        
## [6] "Yes, I'd like to speak to your office manager, please. Yes, hello. This is Michael Scott. I am the Regional Manager of Dunder Mifflin Paper Products. Just wanted to talk to you manager-a-manger. All right. Done deal. Thank you very much, sir. You're a gentleman and a scholar. Oh, I'm sorry. OK. I'm sorry. My mistake. That was a woman I was talking to, so... She had a very low voice. Probably a smoker, so... So that's the way it's done."
textdf <- tibble(character=theoffice$character, text=texts)
head(textdf)
## # A tibble: 6 x 2
##   character text                                                                
##   <chr>     <chr>                                                               
## 1 Michael   All right Jim. Your quarterlies look very good. How are things at t…
## 2 Jim       Oh, I told you. I couldn't close it. So...                          
## 3 Michael   So you've come to the master for guidance? Is this what you're sayi…
## 4 Jim       Actually, you called me in here, but yeah.                          
## 5 Michael   All right. Well, let me show you how it's done.                     
## 6 Michael   Yes, I'd like to speak to your office manager, please. Yes, hello. …

Once you have your text in a tibble, you can tokenize the texts (i.e. separating it into words) with the unnest_tokens function. This function creates a column with the name of its first parameter to put the tokens of the text specified in its second parameter:

textdf %>% 
  unnest_tokens(word, text) -> wordsdf
head(wordsdf)
## # A tibble: 6 x 2
##   character word       
##   <chr>     <chr>      
## 1 Michael   all        
## 2 Michael   right      
## 3 Michael   jim        
## 4 Michael   your       
## 5 Michael   quarterlies
## 6 Michael   look

You can combine it with the count() function of dplyr to make a table of word frequencies in the whole theoffice corpus:

wordsdf %>% 
  count(word) %>% 
  arrange(desc(n))
## # A tibble: 19,617 x 2
##    word      n
##    <chr> <int>
##  1 i     20757
##  2 you   19940
##  3 the   14489
##  4 to    13507
##  5 a     12922
##  6 and    8823
##  7 it     8365
##  8 that   7776
##  9 is     7408
## 10 of     6324
## # … with 19,607 more rows

Tidytext includes some useful datasets, for example the stop_words dataset includes stop words that barely contain meaning in English. You can combine it with the anti_join() function to remove them from the tokens tibble:

data(stop_words)
wordsdf %>% 
  anti_join(stop_words)
## Joining, by = "word"
## # A tibble: 169,835 x 2
##    character word       
##    <chr>     <chr>      
##  1 Michael   jim        
##  2 Michael   quarterlies
##  3 Michael   library    
##  4 Jim       told       
##  5 Jim       close      
##  6 Michael   master     
##  7 Michael   guidance   
##  8 Michael   grasshopper
##  9 Jim       called     
## 10 Jim       yeah       
## # … with 169,825 more rows

You can see the effect of removing stopwords in the ranking we calculated earlier:

wordsdf %>% 
  anti_join(stop_words)  %>% 
  count(word) %>% 
  arrange(desc(n))
## Joining, by = "word"
## # A tibble: 18,946 x 2
##    word        n
##    <chr>   <int>
##  1 yeah     2930
##  2 hey      2232
##  3 michael  1859
##  4 uh       1459
##  5 gonna    1399
##  6 dwight   1340
##  7 jim      1168
##  8 time     1147
##  9 pam      1044
## 10 guys      945
## # … with 18,936 more rows

You can do this for any word list, you can specify your own one and count occurences of words in the list. For example, you can calculate the normalized frequency of the words “uh”, “hey”, and “um” across characters that say more than 5000 words:

wordlist <- data.frame(word=c("uh","hey","um"))
wordsdf %>% 
  inner_join(wordlist)  %>% 
  group_by(character) %>%
  summarize(nuh=n()) -> uhcount
## Joining, by = "word"
wordsdf %>% 
  group_by(character) %>%
  summarize(n=n()) -> charcount

inner_join(uhcount,charcount) %>% filter(n>5000) %>% mutate(ratioUh = nuh/n) %>% arrange(desc(ratioUh))
## Joining, by = "character"
## # A tibble: 19 x 4
##    character   nuh      n ratioUh
##    <chr>     <int>  <int>   <dbl>
##  1 Toby        156   7808 0.0200 
##  2 Pam         539  44929 0.0120 
##  3 Jim         670  57527 0.0116 
##  4 Ryan        128  11714 0.0109 
##  5 Gabe         60   5612 0.0107 
##  6 Darryl       99  11095 0.00892
##  7 Jan          63   7818 0.00806
##  8 Erin         98  12801 0.00766
##  9 Michael    1032 146603 0.00704
## 10 Andy        294  43792 0.00671
## 11 Oscar        74  11808 0.00627
## 12 Kevin        68  12418 0.00548
## 13 Phyllis      36   7461 0.00483
## 14 Dwight      355  74046 0.00479
## 15 Angela       55  13408 0.00410
## 16 Kelly        34   9035 0.00376
## 17 Nellie       21   6827 0.00308
## 18 Stanley      16   5665 0.00282
## 19 Robert        7   5661 0.00124

You can learn more about tidytext at https://www.tidytextmining.com/

Using your own lexicon

You can run yourself any dictionary method through tidytext as long as you have a file with the words and their scores or sentiment classifications. tidytext includes three sentiment lexica with three different kind of annotations. You can get the annotations of each word with the get_sentiments() function. When running the lines below, you will have to say “yes” to download the data files:

get_sentiments("afinn")
## # A tibble: 2,477 x 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # … with 2,467 more rows
get_sentiments("bing")
## # A tibble: 6,786 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # … with 6,776 more rows
get_sentiments("nrc")
## # A tibble: 13,901 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # … with 13,891 more rows

As you see, AFINN has a numerical score per word, bing has words in classes of positivity and negativity, and the NRC lexicon maps words to emotions (these are emotions from Plutchik’s wheel, you can learn more about it in the measuring emotions topic).

You can easily match words in texts to these lexica with tidytext. The code below takes the tokens data frame, matches words with the corresponding scores in the AFINN lexicon, and calculates a ranking of main characters in The Office by the mean sentiment score of the words they say.

wordsdf %>% 
  inner_join(get_sentiments("afinn")) %>%
  group_by(character) %>%
  summarize(sent=mean(value), n=n()) %>%
  arrange(desc(n)) %>%
  head(20) %>%
  arrange(desc(sent)) -> afinndf
## Joining, by = "word"
afinndf
## # A tibble: 20 x 3
##    character  sent     n
##    <chr>     <dbl> <int>
##  1 Pam       0.801  3943
##  2 Jim       0.773  4674
##  3 Ryan      0.748   901
##  4 Holly     0.703   438
##  5 Erin      0.699  1128
##  6 Phyllis   0.698   582
##  7 Robert    0.674   460
##  8 Andy      0.643  3609
##  9 Gabe      0.621   443
## 10 Michael   0.609 12567
## 11 Jan       0.589   613
## 12 Kevin     0.583  1121
## 13 Darryl    0.575   954
## 14 Toby      0.548   564
## 15 Oscar     0.533   885
## 16 Kelly     0.453   793
## 17 Dwight    0.366  6068
## 18 Nellie    0.330   573
## 19 Angela    0.303  1175
## 20 Stanley   0.260   415

As you see, even the most negative character (Stanley) still has a positive mean sentiment score. This is called the postivity bias of language and you can see more examples about it in the Twitter sentiment exercise.

You can do the same as above but based on a file. Saif Mohammad developed the NRC lexica and one of the latests is the NRC Valence, Arousal, and Dominance lexicon (NRC-VAD). The following code downloads the NRC-VAD files and unzips them in your local folder.

download.file("https://saifmohammad.com/WebDocs/VAD/NRC-VAD-Lexicon-Aug2018Release.zip", destfile="NRCVAD.zip")
unzip("NRCVAD.zip")

To use the lexicon, we have to read the valence file in English, name its columns, and convert it into a tibble:

Valencedf <- read.table("NRC-VAD-Lexicon-Aug2018Release/OneFilePerDimension/v-scores.txt", header=F, sep="\t")
names(Valencedf) <- c("word","valence")
vdf <- tibble(Valencedf)

Then in the same way as when we used AFINN, we can calculate the mean valence of words said by each character in The Office:

wordsdf %>% 
  inner_join(vdf) %>%
  group_by(character) %>%
  summarize(meanvalence=mean(valence), n=n()) %>%
  arrange(desc(n)) %>%
  head(20) %>%
  arrange(desc(meanvalence)) -> nrcdf
## Joining, by = "word"
nrcdf
## # A tibble: 20 x 3
##    character meanvalence     n
##    <chr>           <dbl> <int>
##  1 Pam             0.639 13635
##  2 Phyllis         0.636  2287
##  3 Ryan            0.633  3695
##  4 Holly           0.632  1461
##  5 Jim             0.631 17592
##  6 Robert          0.628  1900
##  7 Jan             0.627  2375
##  8 Erin            0.626  3943
##  9 Michael         0.626 46271
## 10 Darryl          0.622  3698
## 11 Toby            0.622  2369
## 12 Andy            0.620 14235
## 13 Kelly           0.619  2821
## 14 Kevin           0.618  3920
## 15 Angela          0.618  4310
## 16 Stanley         0.615  1890
## 17 Oscar           0.611  3702
## 18 Gabe            0.610  1840
## 19 Nellie          0.609  2237
## 20 Dwight          0.600 25126

You can see that the ranking is somehow similar. We can compare the mean sentiment per character in AFINN and the NRC-VAD lexicon with a scatterplot:

joindf <- inner_join(nrcdf, afinndf, by="character")
plot(joindf$meanvalence, joindf$sent, type="n", xlab="NRC Valence", ylab="AFINN score")
text(joindf$meanvalence, joindf$sent, joindf$character)

As you see, there is some correlation but still quite some disagreement for some characters, especially for the ones with the most negative means of sentiment. The correlation coefficient can be calculated like this:

cor.test(joindf$meanvalence, joindf$sent)
## 
##  Pearson's product-moment correlation
## 
## data:  joindf$meanvalence and joindf$sent
## t = 5.0146, df = 18, p-value = 8.996e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4846177 0.9014163
## sample estimates:
##       cor 
## 0.7634204

Something about 0.75 is not so strong given that these two measurements should be very similar. There is a lot of research going on comparing sentiment analyses like these, the most important is to choose one that can be validated for your application case as you can learn in the validating sentiment analsysis exercise.