The R community has developed a lot of packages to run off-the-shelf unsupervised sentiment analysis methods, also called dictionary methods. Once a method is published for another language (e.g. Python), it is just a matter of time that an R developer does the R version of the package. Furthermore, there are useful packages to access dictionaries and other useful resources directly from R. In this tutorial, you will see some examples of how to use these resources. You can find the sources to run this yourself in the associated Github repostory.

As we saw in the unsupervised sentiment analysis topic, VADER is one of the best unsupervised methods to analyze social media text, in particular Twitter. A recently developed R package makes it very easy to run VADER from R. To install and load the package, you just need to do as with any other R package:

install.packages("vader")
library(vader)

The vader package includes a function called get_vader() that will run the VADER method over a text you give it:

get_vader("This book is horrible, but I love it.")
##                      word_scores                         compound
## "{0, 0, 0, -1.25, 0, 0, 4.8, 0}"                          "0.676"
##                              pos                              neu
##                          "0.413"                          "0.427"
##                              neg                        but_count
##                           "0.16"                              "1"

The result is an R vector with named entries:
- word_scores: a string that contains an ordered list with the matched scores for each of the words in the text. In the example you can see the negative score for “horrible” and the postive score for “love”. - compound: the final valence compound of VADER for the whole text after applying modifiers and aggregation rules
- pos, neg, and neu: the parts of the compound for positive, negative, and neutral content. These take into account modifiers and are combined when calculating the compound score - but_count: an additional count of “but” since it can complicate the calculation of sentiment

A named vector can be accessed with the bracket operator:

vaderres <- get_vader("This book is horrible, but I love it.")
vaderres["compound"]
## compound
##  "0.676"

VADER takes into account some punctuation signs, for example an exclamation sign makes a positive word more intense here:

get_vader("This book is horrible, but I love it!")
##                      word_scores                         compound
## "{0, 0, 0, -1.25, 0, 0, 4.8, 0}"                          "0.704"
##                              pos                              neu
##                          "0.425"                          "0.418"
##                              neg                        but_count
##                          "0.157"                              "1"

Words can also modify the meaning of other words, for example when amplifiers make it stronger:

get_vader("This book is bad")
##       word_scores          compound               pos               neu
## "{0, 0, 0, -2.5}"          "-0.542"               "0"           "0.462"
##               neg         but_count
##           "0.538"               "0"
get_vader("This book is very bad")
##            word_scores               compound                    pos
## "{0, 0, 0, 0, -2.793}"               "-0.585"                    "0"
##                    neu                    neg              but_count
##                "0.513"                "0.487"                    "0"

Modifiers can make sentiment weaker too, for example the word “slightly” in this case:

get_vader("This book is bad")
##       word_scores          compound               pos               neu
## "{0, 0, 0, -2.5}"          "-0.542"               "0"           "0.462"
##               neg         but_count
##           "0.538"               "0"
get_vader("This book is slightly bad")
##            word_scores               compound                    pos
## "{0, 0, 0, 0, -2.207}"               "-0.495"                    "0"
##                    neu                    neg              but_count
##                "0.555"                "0.445"                    "0"

And negators reverse the valence of a word and weaken it a bit, based on empirical observations reported in the original paper:

get_vader("This book is not bad")
##          word_scores             compound                  pos
## "{0, 0, 0, 0, 1.85}"              "0.431"              "0.416"
##                  neu                  neg            but_count
##              "0.584"                  "0"                  "0"

The vader package also includes a vader_df() function to run VADER over a series of texts and produce results in a data frame. Here we use the schrute library to load the scripts of the US TV series “The Office” and run VADER over the first six lines of the series:

library(schrute)
texts <- head(theoffice$text) vader_df(texts) ## text ## 1 All right Jim. Your quarterlies look very good. How are things at the library? ## 2 Oh, I told you. I couldn't close it. So... ## 3 So you've come to the master for guidance? Is this what you're saying, grasshopper? ## 4 Actually, you called me in here, but yeah. ## 5 All right. Well, let me show you how it's done. ## 6 Yes, I'd like to speak to your office manager, please. Yes, hello. This is Michael Scott. I am the Regional Manager of Dunder Mifflin Paper Products. Just wanted to talk to you manager-a-manger. All right. Done deal. Thank you very much, sir. You're a gentleman and a scholar. Oh, I'm sorry. OK. I'm sorry. My mistake. That was a woman I was talking to, so... She had a very low voice. Probably a smoker, so... So that's the way it's done. ## word_scores ## 1 {0, 0, 0, 0, 0, 0, 0, 2.193, 0, 0, 0, 0, 0, 0} ## 2 {0, 0, 0, 0, 0, 0, 0, 0, 0} ## 3 {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0} ## 4 {0, 0, 0, 0, 0, 0, 0, 1.8} ## 5 {0, 0, 1.1, 0, 0, 0, 0, 0, 0, 0} ## 6 {1.7, 0, 1.5, 0, 0, 0, 0, 0, 0, 1.3, 1.7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -0.3, 2.133, 0, -0.3, 0, -1.4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1.393, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0} ## compound pos neu neg but_count ## 1 0.493 0.197 0.803 0.000 0 ## 2 0.000 0.000 1.000 0.000 0 ## 3 0.000 0.000 1.000 0.000 0 ## 4 0.421 0.286 0.714 0.000 1 ## 5 0.273 0.189 0.811 0.000 0 ## 6 0.857 0.168 0.754 0.078 0 Due to the modification rules, VADER is not as fast as other dictionary-based methods and running it for the whole theoffice dataset can take from minutes to hours depending on your computer. However, the vader_df function makes it much faster than a loop. ### Syuzhet Syuzhet is a dictionary developed by (Matthew Jockers](https://www.matthewjockers.net/2015/02/02/syuzhet/) to analyze the arcs of sentiment in novels. The R package syuzhet makes it very easy to use and includes other sentiment dictionaries too: install.packages("syuzhet") library(syuzhet) The syuzhet package provides the get_sentiment() function that works the same way as the get_vader() function, but runs the syuzhet lexicon and computes a mean over the words of the text: get_sentiment("This book is horrible") ## [1] -0.75 get_sentiment("This book is horrible, but I love it") ## [1] 0 get_sentiment("This book is horrible, but I love it!") ## [1] 0 As you see in the examples above, the first sentence is scored as negative, but the coexistence of a negative and a positive word in the second example makes them cancel out and give a zero. There are no modification rules, so having an exclamation mark makes no difference. The get_sentiment() function can be run for a vector with more than one text, making it simple to get many scores: get_sentiment(head(theoffice$text))
## [1] 2.15 0.00 0.50 0.00 1.60 4.60

syuzhet also includes other dictionaries, like the popular AFINN dictionary. In AFINN, words are scored in the same scale as SentiStrength, as integers from -5 to +5, and thus the output looks a bit different:

get_sentiment(head(theoffice$text), method="afinn") ## [1] 3 0 0 1 0 3 When combined with dplyr data wrangling tools, packages like syuzhet make it very easy to compute sentiment aggregates. Here we run Syuzhet over all the lines in The Office, since it’s much faster than VADER. We make a ranking by mean sentiment over the main characters of the series: library(dplyr) ## ## Attaching package: 'dplyr' ## The following objects are masked from 'package:stats': ## ## filter, lag ## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union theoffice$sentiment <- get_sentiment(theoffice$text) theoffice %>% group_by(character) %>% summarise(sent=mean(sentiment), n=n()) %>% arrange(desc(n)) %>% head(n=20) %>% arrange(desc(sent)) ## # A tibble: 20 x 3 ## character sent n ## <chr> <dbl> <int> ## 1 Michael 0.335 10921 ## 2 Andy 0.280 3754 ## 3 Nellie 0.260 529 ## 4 Jim 0.249 6303 ## 5 Gabe 0.237 427 ## 6 Jan 0.232 810 ## 7 Pam 0.231 5031 ## 8 Ryan 0.222 1198 ## 9 Darryl 0.217 1182 ## 10 Holly 0.208 555 ## 11 Toby 0.200 818 ## 12 Erin 0.199 1440 ## 13 Angela 0.181 1569 ## 14 Phyllis 0.179 970 ## 15 Kevin 0.176 1564 ## 16 Oscar 0.174 1368 ## 17 Dwight 0.171 6847 ## 18 Kelly 0.161 841 ## 19 Stanley 0.120 678 ## 20 Meredith 0.103 559 ### Tidytext tidytext is a very useful package that follows the same philosophy as dplyr but for text. It is very powerful for text analysis, including sentiment analysis. install.packages("tidytext") library(tidytext) When you run some tidytext functions, it might ask you to install additional packages like “textdata” or download resources like the sentiment dictionaries. The plain installation keeps these additional resources to a minimum and you will be asked to install them only once the first time you use them. Tidytext, as other dplyr-related packages, usest tibbles, which is an upgraded version of a data frame. Converting plain text to a tibble is rather simple. In this example we create a tibble with the text of the lines of the theoffice dataset and add a column with the name of the character that says the line: texts <- theoffice$text
head(texts)
## [1] "All right Jim. Your quarterlies look very good. How are things at the library?"
## [2] "Oh, I told you. I couldn't close it. So..."
## [3] "So you've come to the master for guidance? Is this what you're saying, grasshopper?"
## [4] "Actually, you called me in here, but yeah."
## [5] "All right. Well, let me show you how it's done."
## [6] "Yes, I'd like to speak to your office manager, please. Yes, hello. This is Michael Scott. I am the Regional Manager of Dunder Mifflin Paper Products. Just wanted to talk to you manager-a-manger. All right. Done deal. Thank you very much, sir. You're a gentleman and a scholar. Oh, I'm sorry. OK. I'm sorry. My mistake. That was a woman I was talking to, so... She had a very low voice. Probably a smoker, so... So that's the way it's done."
textdf <- tibble(character=theoffice$character, text=texts) head(textdf) ## # A tibble: 6 x 2 ## character text ## <chr> <chr> ## 1 Michael All right Jim. Your quarterlies look very good. How are things at t… ## 2 Jim Oh, I told you. I couldn't close it. So... ## 3 Michael So you've come to the master for guidance? Is this what you're sayi… ## 4 Jim Actually, you called me in here, but yeah. ## 5 Michael All right. Well, let me show you how it's done. ## 6 Michael Yes, I'd like to speak to your office manager, please. Yes, hello. … Once you have your text in a tibble, you can tokenize the texts (i.e. separating it into words) with the unnest_tokens function. This function creates a column with the name of its first parameter to put the tokens of the text specified in its second parameter: textdf %>% unnest_tokens(word, text) -> wordsdf head(wordsdf) ## # A tibble: 6 x 2 ## character word ## <chr> <chr> ## 1 Michael all ## 2 Michael right ## 3 Michael jim ## 4 Michael your ## 5 Michael quarterlies ## 6 Michael look You can combine it with the count() function of dplyr to make a table of word frequencies in the whole theoffice corpus: wordsdf %>% count(word) %>% arrange(desc(n)) ## # A tibble: 19,617 x 2 ## word n ## <chr> <int> ## 1 i 20757 ## 2 you 19940 ## 3 the 14489 ## 4 to 13507 ## 5 a 12922 ## 6 and 8823 ## 7 it 8365 ## 8 that 7776 ## 9 is 7408 ## 10 of 6324 ## # … with 19,607 more rows Tidytext includes some useful datasets, for example the stop_words dataset includes stop words that barely contain meaning in English. You can combine it with the anti_join() function to remove them from the tokens tibble: data(stop_words) wordsdf %>% anti_join(stop_words) ## Joining, by = "word" ## # A tibble: 169,835 x 2 ## character word ## <chr> <chr> ## 1 Michael jim ## 2 Michael quarterlies ## 3 Michael library ## 4 Jim told ## 5 Jim close ## 6 Michael master ## 7 Michael guidance ## 8 Michael grasshopper ## 9 Jim called ## 10 Jim yeah ## # … with 169,825 more rows You can see the effect of removing stopwords in the ranking we calculated earlier: wordsdf %>% anti_join(stop_words) %>% count(word) %>% arrange(desc(n)) ## Joining, by = "word" ## # A tibble: 18,946 x 2 ## word n ## <chr> <int> ## 1 yeah 2930 ## 2 hey 2232 ## 3 michael 1859 ## 4 uh 1459 ## 5 gonna 1399 ## 6 dwight 1340 ## 7 jim 1168 ## 8 time 1147 ## 9 pam 1044 ## 10 guys 945 ## # … with 18,936 more rows You can do this for any word list, you can specify your own one and count occurences of words in the list. For example, you can calculate the normalized frequency of the words “uh”, “hey”, and “um” across characters that say more than 5000 words: wordlist <- data.frame(word=c("uh","hey","um")) wordsdf %>% inner_join(wordlist) %>% group_by(character) %>% summarize(nuh=n()) -> uhcount ## Joining, by = "word" wordsdf %>% group_by(character) %>% summarize(n=n()) -> charcount inner_join(uhcount,charcount) %>% filter(n>5000) %>% mutate(ratioUh = nuh/n) %>% arrange(desc(ratioUh)) ## Joining, by = "character" ## # A tibble: 19 x 4 ## character nuh n ratioUh ## <chr> <int> <int> <dbl> ## 1 Toby 156 7808 0.0200 ## 2 Pam 539 44929 0.0120 ## 3 Jim 670 57527 0.0116 ## 4 Ryan 128 11714 0.0109 ## 5 Gabe 60 5612 0.0107 ## 6 Darryl 99 11095 0.00892 ## 7 Jan 63 7818 0.00806 ## 8 Erin 98 12801 0.00766 ## 9 Michael 1032 146603 0.00704 ## 10 Andy 294 43792 0.00671 ## 11 Oscar 74 11808 0.00627 ## 12 Kevin 68 12418 0.00548 ## 13 Phyllis 36 7461 0.00483 ## 14 Dwight 355 74046 0.00479 ## 15 Angela 55 13408 0.00410 ## 16 Kelly 34 9035 0.00376 ## 17 Nellie 21 6827 0.00308 ## 18 Stanley 16 5665 0.00282 ## 19 Robert 7 5661 0.00124 You can learn more about tidytext at https://www.tidytextmining.com/ ### Using your own lexicon You can run yourself any dictionary method through tidytext as long as you have a file with the words and their scores or sentiment classifications. tidytext includes three sentiment lexica with three different kind of annotations. You can get the annotations of each word with the get_sentiments() function. When running the lines below, you will have to say “yes” to download the data files: get_sentiments("afinn") ## # A tibble: 2,477 x 2 ## word value ## <chr> <dbl> ## 1 abandon -2 ## 2 abandoned -2 ## 3 abandons -2 ## 4 abducted -2 ## 5 abduction -2 ## 6 abductions -2 ## 7 abhor -3 ## 8 abhorred -3 ## 9 abhorrent -3 ## 10 abhors -3 ## # … with 2,467 more rows get_sentiments("bing") ## # A tibble: 6,786 x 2 ## word sentiment ## <chr> <chr> ## 1 2-faces negative ## 2 abnormal negative ## 3 abolish negative ## 4 abominable negative ## 5 abominably negative ## 6 abominate negative ## 7 abomination negative ## 8 abort negative ## 9 aborted negative ## 10 aborts negative ## # … with 6,776 more rows get_sentiments("nrc") ## # A tibble: 13,901 x 2 ## word sentiment ## <chr> <chr> ## 1 abacus trust ## 2 abandon fear ## 3 abandon negative ## 4 abandon sadness ## 5 abandoned anger ## 6 abandoned fear ## 7 abandoned negative ## 8 abandoned sadness ## 9 abandonment anger ## 10 abandonment fear ## # … with 13,891 more rows As you see, AFINN has a numerical score per word, bing has words in classes of positivity and negativity, and the NRC lexicon maps words to emotions (these are emotions from Plutchik’s wheel, you can learn more about it in the measuring emotions topic). You can easily match words in texts to these lexica with tidytext. The code below takes the tokens data frame, matches words with the corresponding scores in the AFINN lexicon, and calculates a ranking of main characters in The Office by the mean sentiment score of the words they say. wordsdf %>% inner_join(get_sentiments("afinn")) %>% group_by(character) %>% summarize(sent=mean(value), n=n()) %>% arrange(desc(n)) %>% head(20) %>% arrange(desc(sent)) -> afinndf ## Joining, by = "word" afinndf ## # A tibble: 20 x 3 ## character sent n ## <chr> <dbl> <int> ## 1 Pam 0.801 3943 ## 2 Jim 0.773 4674 ## 3 Ryan 0.748 901 ## 4 Holly 0.703 438 ## 5 Erin 0.699 1128 ## 6 Phyllis 0.698 582 ## 7 Robert 0.674 460 ## 8 Andy 0.643 3609 ## 9 Gabe 0.621 443 ## 10 Michael 0.609 12567 ## 11 Jan 0.589 613 ## 12 Kevin 0.583 1121 ## 13 Darryl 0.575 954 ## 14 Toby 0.548 564 ## 15 Oscar 0.533 885 ## 16 Kelly 0.453 793 ## 17 Dwight 0.366 6068 ## 18 Nellie 0.330 573 ## 19 Angela 0.303 1175 ## 20 Stanley 0.260 415 As you see, even the most negative character (Stanley) still has a positive mean sentiment score. This is called the postivity bias of language and you can see more examples about it in the Twitter sentiment exercise. You can do the same as above but based on a file. Saif Mohammad developed the NRC lexica and one of the latests is the NRC Valence, Arousal, and Dominance lexicon (NRC-VAD). The following code downloads the NRC-VAD files and unzips them in your local folder. download.file("https://saifmohammad.com/WebDocs/VAD/NRC-VAD-Lexicon-Aug2018Release.zip", destfile="NRCVAD.zip") unzip("NRCVAD.zip") To use the lexicon, we have to read the valence file in English, name its columns, and convert it into a tibble: Valencedf <- read.table("NRC-VAD-Lexicon-Aug2018Release/OneFilePerDimension/v-scores.txt", header=F, sep="\t") names(Valencedf) <- c("word","valence") vdf <- tibble(Valencedf) Then in the same way as when we used AFINN, we can calculate the mean valence of words said by each character in The Office: wordsdf %>% inner_join(vdf) %>% group_by(character) %>% summarize(meanvalence=mean(valence), n=n()) %>% arrange(desc(n)) %>% head(20) %>% arrange(desc(meanvalence)) -> nrcdf ## Joining, by = "word" nrcdf ## # A tibble: 20 x 3 ## character meanvalence n ## <chr> <dbl> <int> ## 1 Pam 0.639 13635 ## 2 Phyllis 0.636 2287 ## 3 Ryan 0.633 3695 ## 4 Holly 0.632 1461 ## 5 Jim 0.631 17592 ## 6 Robert 0.628 1900 ## 7 Jan 0.627 2375 ## 8 Erin 0.626 3943 ## 9 Michael 0.626 46271 ## 10 Darryl 0.622 3698 ## 11 Toby 0.622 2369 ## 12 Andy 0.620 14235 ## 13 Kelly 0.619 2821 ## 14 Kevin 0.618 3920 ## 15 Angela 0.618 4310 ## 16 Stanley 0.615 1890 ## 17 Oscar 0.611 3702 ## 18 Gabe 0.610 1840 ## 19 Nellie 0.609 2237 ## 20 Dwight 0.600 25126 You can see that the ranking is somehow similar. We can compare the mean sentiment per character in AFINN and the NRC-VAD lexicon with a scatterplot: joindf <- inner_join(nrcdf, afinndf, by="character") plot(joindf$meanvalence, joindf$sent, type="n", xlab="NRC Valence", ylab="AFINN score") text(joindf$meanvalence, joindf$sent, joindf$character)

As you see, there is some correlation but still quite some disagreement for some characters, especially for the ones with the most negative means of sentiment. The correlation coefficient can be calculated like this:

cor.test(joindf$meanvalence, joindf$sent)
##
##  Pearson's product-moment correlation
##
## data:  joindf$meanvalence and joindf$sent
## t = 5.0146, df = 18, p-value = 8.996e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4846177 0.9014163
## sample estimates:
##       cor
## 0.7634204

Something about 0.75 is not so strong given that these two measurements should be very similar. There is a lot of research going on comparing sentiment analyses like these, the most important is to choose one that can be validated for your application case as you can learn in the validating sentiment analsysis exercise.