Sentiment Analysis: Computerized quantification of subjective states from text
Sentiment analysis is a subfield of Natural Language Processing. It can be combined with other tools like Named-Entity Recognition or Topic Modelling to contextualize the sentiment, for example finding its origin or targets. Here we focus on how to quantify sentiment from text, especially in social media and other kinds of digital traces.
There has been a scientific boom in sentiment analysis with several workshops, journal issues, and books devoted to the topic. Every year there are hundreds of research papers on the topic. You can see this rise in the Google Trends volume for the term “sentiment analysis”:
While peak interest seems to have been reached in 2019, there is still a lot of interest and open research questions in sentiment analysis.
Unsupervised sentiment analysis:
Supervised sentiment analysis:
Both approaches can be combined in what is called semi-supervised or ensemble methods. Some of these approaches mix supervised and unsupervised models in one classifier.
Evaluation and generalizability are key arguments when choosing a sentiment analysis method. You can learn more about them at the end of the Supervised Sentiment analysis topic and compare yourself supervised and unsupervised methods in Evaluating sentiment analysis methods exercise.
In this topic we are going to cover various approaches to unsupervised sentiment analysis with examples of methods and software you can use.
The pioneer work of Philip Stone in 1966 proposed to process text with a computer to detect the use of words of various categories. This set the basis for dictionary methods in unsupervised sentiment analysis, which are based on counting the number of appearances of the words of a list in a text. The original version of the General Inquirer contained many word classes including parts of speech, topics, as well as terms for emotions and evaluative language.
The original dictionaries of the General Inquirer were merged with other later dictionaries and an updated version was released in the 1990s. You can access the lists of positive words and of negative words of this version, which served as input for later methods like SentiStrength.
The SentimentAnalysis R package contains the General Inquirer (GI) dictionary and methods to match words in text.
LIWC (pronounced “Luke”) was developed as a click-and-run software by James Pennebaker in 2001. Inspired by the General Inquirer, it contains a set of word lists that are matched against words in the text to compute frequencies for each list. The word lists of LIWC were designed to cover both linguistic classes and to capture psychological processes such as cognitive processes, social processes, and emotions. Word lists for LIWC are produced by groups of experts that compare their individual word lists and expand them with synonyms. There have been three versions of LIWC in English (2001, 2007 and 2015) and dictionaries have been generated with the same method for several languages including German, Spanish, French, Arabic, and Chinese.
Here you can see an example of how LIWC words on a text:
LIWC first tokenizes the text, i.e. it identifies words by looking for separations like whitespaces and punctuation. Then LIWC iterates over each token (word) and checks if it matches any word list in the dictionary. These matches can be “hard” matches for the same exact character string, or “soft” matches with Kleene stems that are prefixes of a word. These are entries in the dictionary that end with a star symbol (“*”). You see this in the example for the entry “worr*” that matches “worry” and for “pizza*” that matches “pizza”.
In the example above you can see that words can belong to several word lists, for example the entry for "worr*" is in the “affect” list, in the “negemo” list, and in the “anxiety” list. After running these matchings, LIWC produces a list of frequency measures as the percentage of words in the whole text that are matched against each word list. In the example above, there are 12.5% words of the “negemo” list and 0% words of the “posemo” list.
The 2015 version of LIWC includes netspeak terms such as “WTF” or “LOL” and emoticons like “:)”, LIWC is a very popular tool due to the ease to use it, for example it offers a way to visualize which words are matched. It is very important to look at these matches to understand LIWC emotion word frequencies, as you can learn in the Social Data Science story about 9/11 pagers.