You can find the markdown files and data for this exercise in the corresponding Github folder.

Let’s apply sentiment analysis to a real case of social media data. In this exercise, we are going to collect Twitter data and apply VADER to it, with the purpose to test if the sentiment of tweets is correlated with the number of retweets they receive and with the number of followers of the person tweeting them.

# 1. Retweets of an influential user

Choose an English-speaking celebrity or a media Twitter user account with more than 5000 tweets. For example, you can use “nytimes” or “barackobama”. Load the rtweet package and use the get_timeline function to get the last 3200 tweets of that account.

#Your code here

Now load vader and run it over the text of the tweets that you received. After you ran it, add a new column to the data frame with the sentiment class of each tweet, using as thresholds -0.05 and 0.05.

#Your code here

Before analyzing the data, we should clean it. Use dplyr to delete from the dataset the tweets that are retweets and the ones that have no retweets, becuase they are too recent.

#Your code here

The plan is to fit a regression model of the logarithm of the number of retweets of a tweet as a function of the sentiment class of the tweet. Before fitting we are going to “relevel” the class so that the reference class is “Neutral”. This will help to make the results of the fit more interpretable:

tweetsfit$class <- relevel(as.factor(tweetsfit$class), ref="Neutral")

Now you can fit the model with lm and print the summary of the model to have an idea of what is retweeted more:

#Your code here

The summary gives you the idea that some tweets might be retweeted more if their sentiment class was different than neutral. The summary also gives us an idea of the quality of the fit ($$R^2$$), which is likely rather low since there are many things involved in the number of retweets of a tweet beyond its sentiment.

Want to have an estimate of the uncertainty of the coefficient for the negative class. To do so, you can apply bootstrapping as we did in previous exercises. You can do it with a loop or by defining a function for the library boot. When you have the values of coefficients in each bootstrap sample, plot a histogram. Does it look far from zero?

#Your code here

Now repeat the above analysis for the count of favorites as a dependent variable. Fit only for tweets with at least one favorite and that are not retweets. Do you see a difference?

#Your code here

# 2. Are people with more followers more emotional?

We are going to address the question by generating a random tweets sample in English. Look for the coordinates of a major city in the US or the UK and use them to define the “geocode” parameter ot a search in the Twitter API. Set a radius around the point of few miles and search up to 5000 tweets in English and do not keep retweets.

#Your code here

Now, as we did in the previous case, run VADER over the tweets and classify them into sentiment classes of positive, negative, and neutral tweets.

#Your code here

Plot a histogram of the compound value of the tweets. Do tweets tend to be more positive than negative? What is the mean of compound values? What are the counts of tweets per class?

#Your code here

Remove tweets from users without followers as they are likely bots/spam or too new. Calculate the mean of the logarithm of the number of followers of tweets of each sentiment class. Do you notice differences? Are they large?

#Your code here

Now following the example of the models of retweets above, relevel the sentiment classes and fit a linear regression model of the logarithm of the number of followers as a function of class, ignoring tweets that are retweets.

#Your code here

To assess the uncertainty in the value of coefficients, apply the same bootstrapping approach as for the other models for the coefficient of negative sentiment here. Does it look far from zero?

#Your code here

# 3. To do more

• Try with a different sentiment analysis method (e.g. syuzhet). Do results change? Which one is more accurate for Twitter data?
• Use the Twitter API to search for data at least two days old and at most a week old to measure the number of retweets of tweets. Do negative tweets get retweeted more often? What is the role of the number of followers in this case?
• Search or stream data for a bit longer to get many recent tweets. Do you get stronger differences in the differences of the number of followers by sentiment?