In this exercise we will test a hypothesis from Social Impact Theory against data from Twitter users. We will use the Twitter API to collect data, so for that reason you need a Twitter user account to do this exercise.

You can find the markdown files and data for this exercise in the corresponding Github folder.

You will use a list of Twitter users of your choice or a random sample of users. First, we will measure their number of followers. Then we will retrieve the last tweets of each user and measure the average number of retweets they receive, as a measure of their social impact. We will fit a regression model of their impact as a function of their amount of followers, and test this way if there is a sublinear (but positive) relationship between the size of their audience and the extent of their impact.

Tasks:

  1. Connecting to Twitter with the rtweet package

  2. Construct the timelines of Twitter users

  3. Visualize distributions and scatter plots

  4. Fit and visualize a regression model

  5. Bootstrapping

1. Connecting to Twitter with the rtweet package

1.1 Getting ready with rtweet

Before starting, install the rtweet package. Also remember to properly set your working directory under “Session” -> “Set Working Directory” ->“To Source File Location”

install.packages("rtweet")

Now we will load the rtweet package:

library(rtweet)

You can connect to the Twitter API in two ways. One is just to continue with these exercises, when you run the next line of code in interactive mode, your browser will prompt to ask you to give permissions to the rstats2twitter. Once you give it permissions, you will be able to access the Twitter API. The second way is getting a developer account and making an app. It is more complicated but better if you do a longer project with this API. Check the rtweet tutorial for more details.

1.2 Testing the connection with some examples

After that you will be able to access the Twitter API from R. For example, we can get basic information on the Twitter account of the New York Times:

library(dplyr)
glimpse(lookup_users(users = "nytimes"))
## Rows: 1
## Columns: 90
## $ user_id                 <chr> "807095"
## $ status_id               <chr> "1360997268585918466"
## $ created_at              <dttm> 2021-02-14 17:00:09
## $ screen_name             <chr> "nytimes"
## $ text                    <chr> "Even as a boy, Russian opposition leader Ale…
## $ source                  <chr> "SocialFlow"
## $ display_text_width      <int> NA
## $ reply_to_status_id      <lgl> NA
## $ reply_to_user_id        <lgl> NA
## $ reply_to_screen_name    <lgl> NA
## $ is_quote                <lgl> FALSE
## $ is_retweet              <lgl> FALSE
## $ favorite_count          <int> 6
## $ retweet_count           <int> 0
## $ quote_count             <int> NA
## $ reply_count             <int> NA
## $ hashtags                <list> [NA]
## $ symbols                 <list> [NA]
## $ urls_url                <list> ["twitter.com/i/web/status/1…"]
## $ urls_t.co               <list> ["https://t.co/Vsy1IKYsCN"]
## $ urls_expanded_url       <list> ["https://twitter.com/i/web/status/136099726…
## $ media_url               <list> [NA]
## $ media_t.co              <list> [NA]
## $ media_expanded_url      <list> [NA]
## $ media_type              <list> [NA]
## $ ext_media_url           <list> [NA]
## $ ext_media_t.co          <list> [NA]
## $ ext_media_expanded_url  <list> [NA]
## $ ext_media_type          <chr> NA
## $ mentions_user_id        <list> [NA]
## $ mentions_screen_name    <list> [NA]
## $ lang                    <chr> "en"
## $ quoted_status_id        <chr> NA
## $ quoted_text             <chr> NA
## $ quoted_created_at       <dttm> NA
## $ quoted_source           <chr> NA
## $ quoted_favorite_count   <int> NA
## $ quoted_retweet_count    <int> NA
## $ quoted_user_id          <chr> NA
## $ quoted_screen_name      <chr> NA
## $ quoted_name             <chr> NA
## $ quoted_followers_count  <int> NA
## $ quoted_friends_count    <int> NA
## $ quoted_statuses_count   <int> NA
## $ quoted_location         <chr> NA
## $ quoted_description      <chr> NA
## $ quoted_verified         <lgl> NA
## $ retweet_status_id       <chr> NA
## $ retweet_text            <chr> NA
## $ retweet_created_at      <dttm> NA
## $ retweet_source          <chr> NA
## $ retweet_favorite_count  <int> NA
## $ retweet_retweet_count   <int> NA
## $ retweet_user_id         <chr> NA
## $ retweet_screen_name     <chr> NA
## $ retweet_name            <chr> NA
## $ retweet_followers_count <int> NA
## $ retweet_friends_count   <int> NA
## $ retweet_statuses_count  <int> NA
## $ retweet_location        <chr> NA
## $ retweet_description     <chr> NA
## $ retweet_verified        <lgl> NA
## $ place_url               <chr> NA
## $ place_name              <chr> NA
## $ place_full_name         <chr> NA
## $ place_type              <chr> NA
## $ country                 <chr> NA
## $ country_code            <chr> NA
## $ geo_coords              <list> [<NA, NA>]
## $ coords_coords           <list> [<NA, NA>]
## $ bbox_coords             <list> [<NA, NA, NA, NA, NA, NA, NA, NA>]
## $ status_url              <chr> "https://twitter.com/NA/status/13609972685859…
## $ name                    <chr> "The New York Times"
## $ location                <chr> "New York City"
## $ description             <chr> "News tips? Share them here: https://t.co/ghL…
## $ url                     <chr> "http://t.co/ahvuWqicF9"
## $ protected               <lgl> FALSE
## $ followers_count         <int> 49342177
## $ friends_count           <int> 901
## $ listed_count            <int> 207513
## $ statuses_count          <int> 422456
## $ favourites_count        <int> 18377
## $ account_created_at      <dttm> 2007-03-02 20:41:42
## $ verified                <lgl> TRUE
## $ profile_url             <chr> "http://t.co/ahvuWqicF9"
## $ profile_expanded_url    <chr> "http://www.nytimes.com/"
## $ account_lang            <lgl> NA
## $ profile_banner_url      <chr> "https://pbs.twimg.com/profile_banners/807095…
## $ profile_background_url  <chr> "http://abs.twimg.com/images/themes/theme14/b…
## $ profile_image_url       <chr> "http://pbs.twimg.com/profile_images/10982445…

The glimpse() function of dplyr helps us to see a bit better the returning data because it has too many columns to be readable otherwise.

You can also get the last ten tweets posted by the New York Times account:

glimpse(get_timeline(user="nytimes", n=10))
## Rows: 10
## Columns: 90
## $ user_id                 <chr> "807095", "807095", "807095", "807095", "8070…
## $ status_id               <chr> "1360997268585918466", "1360992210070753280",…
## $ created_at              <dttm> 2021-02-14 17:00:09, 2021-02-14 16:40:03, 20…
## $ screen_name             <chr> "nytimes", "nytimes", "nytimes", "nytimes", "…
## $ text                    <chr> "Even as a boy, Russian opposition leader Ale…
## $ source                  <chr> "SocialFlow", "SocialFlow", "SocialFlow", "So…
## $ display_text_width      <dbl> 193, 240, 140, 211, 186, 172, 202, 134, 148, …
## $ reply_to_status_id      <chr> NA, NA, NA, NA, NA, NA, "1360973871688671233"…
## $ reply_to_user_id        <chr> NA, NA, NA, NA, NA, NA, "807095", "807095", "…
## $ reply_to_screen_name    <chr> NA, NA, NA, NA, NA, NA, "nytimes", "nytimes",…
## $ is_quote                <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ is_retweet              <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALS…
## $ favorite_count          <int> 6, 105, 0, 187, 69, 577, 65, 84, 43, 41
## $ retweet_count           <int> 0, 15, 100, 38, 20, 67, 18, 18, 12, 8
## $ quote_count             <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## $ reply_count             <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## $ hashtags                <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA, NA]
## $ symbols                 <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA, NA]
## $ urls_url                <list> ["nyti.ms/2OCwx2e", "nyti.ms/3pjre48", NA, "…
## $ urls_t.co               <list> ["https://t.co/wBqHXCBj3z", "https://t.co/il…
## $ urls_expanded_url       <list> ["https://nyti.ms/2OCwx2e", "https://nyti.ms…
## $ media_url               <list> [NA, NA, NA, NA, NA, NA, NA, "http://pbs.twi…
## $ media_t.co              <list> [NA, NA, NA, NA, NA, NA, NA, "https://t.co/v…
## $ media_expanded_url      <list> [NA, NA, NA, NA, NA, NA, NA, "https://twitte…
## $ media_type              <list> [NA, NA, NA, NA, NA, NA, NA, "photo", "photo…
## $ ext_media_url           <list> [NA, NA, NA, NA, NA, NA, NA, "http://pbs.twi…
## $ ext_media_t.co          <list> [NA, NA, NA, NA, NA, NA, NA, "https://t.co/v…
## $ ext_media_expanded_url  <list> [NA, NA, NA, NA, NA, NA, NA, "https://twitte…
## $ ext_media_type          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## $ mentions_user_id        <list> [NA, NA, "119478339", NA, NA, NA, NA, NA, NA…
## $ mentions_screen_name    <list> [NA, NA, "elizashapiro", NA, NA, NA, NA, NA,…
## $ lang                    <chr> "en", "en", "en", "en", "en", "en", "en", "en…
## $ quoted_status_id        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## $ quoted_text             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## $ quoted_created_at       <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## $ quoted_source           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## $ quoted_favorite_count   <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## $ quoted_retweet_count    <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## $ quoted_user_id          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## $ quoted_screen_name      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## $ quoted_name             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## $ quoted_followers_count  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## $ quoted_friends_count    <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## $ quoted_statuses_count   <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## $ quoted_location         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## $ quoted_description      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## $ quoted_verified         <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## $ retweet_status_id       <chr> NA, NA, "1360944318819823623", NA, NA, NA, NA…
## $ retweet_text            <chr> NA, NA, "If you told me in August that NYC wo…
## $ retweet_created_at      <dttm> NA, NA, 2021-02-14 13:29:45, NA, NA, NA, NA,…
## $ retweet_source          <chr> NA, NA, "Twitter Web App", NA, NA, NA, NA, NA…
## $ retweet_favorite_count  <int> NA, NA, 378, NA, NA, NA, NA, NA, NA, NA
## $ retweet_retweet_count   <int> NA, NA, 100, NA, NA, NA, NA, NA, NA, NA
## $ retweet_user_id         <chr> NA, NA, "119478339", NA, NA, NA, NA, NA, NA, …
## $ retweet_screen_name     <chr> NA, NA, "elizashapiro", NA, NA, NA, NA, NA, N…
## $ retweet_name            <chr> NA, NA, "Eliza Shapiro", NA, NA, NA, NA, NA, …
## $ retweet_followers_count <int> NA, NA, 37688, NA, NA, NA, NA, NA, NA, NA
## $ retweet_friends_count   <int> NA, NA, 2352, NA, NA, NA, NA, NA, NA, NA
## $ retweet_statuses_count  <int> NA, NA, 6779, NA, NA, NA, NA, NA, NA, NA
## $ retweet_location        <chr> NA, NA, "", NA, NA, NA, NA, NA, NA, NA
## $ retweet_description     <chr> NA, NA, "I write about New York schools for T…
## $ retweet_verified        <lgl> NA, NA, TRUE, NA, NA, NA, NA, NA, NA, NA
## $ place_url               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## $ place_name              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## $ place_full_name         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## $ place_type              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## $ country                 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## $ country_code            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## $ geo_coords              <list> [<NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <NA…
## $ coords_coords           <list> [<NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <NA…
## $ bbox_coords             <list> [<NA, NA, NA, NA, NA, NA, NA, NA>, <NA, NA, …
## $ status_url              <chr> "https://twitter.com/nytimes/status/136099726…
## $ name                    <chr> "The New York Times", "The New York Times", "…
## $ location                <chr> "New York City", "New York City", "New York C…
## $ description             <chr> "News tips? Share them here: https://t.co/ghL…
## $ url                     <chr> "http://t.co/ahvuWqicF9", "http://t.co/ahvuWq…
## $ protected               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ followers_count         <int> 49342177, 49342177, 49342177, 49342177, 49342…
## $ friends_count           <int> 901, 901, 901, 901, 901, 901, 901, 901, 901, …
## $ listed_count            <int> 207513, 207513, 207513, 207513, 207513, 20751…
## $ statuses_count          <int> 422456, 422456, 422456, 422456, 422456, 42245…
## $ favourites_count        <int> 18377, 18377, 18377, 18377, 18377, 18377, 183…
## $ account_created_at      <dttm> 2007-03-02 20:41:42, 2007-03-02 20:41:42, 20…
## $ verified                <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU…
## $ profile_url             <chr> "http://t.co/ahvuWqicF9", "http://t.co/ahvuWq…
## $ profile_expanded_url    <chr> "http://www.nytimes.com/", "http://www.nytime…
## $ account_lang            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## $ profile_banner_url      <chr> "https://pbs.twimg.com/profile_banners/807095…
## $ profile_background_url  <chr> "http://abs.twimg.com/images/themes/theme14/b…
## $ profile_image_url       <chr> "http://pbs.twimg.com/profile_images/10982445…

2. Construct the timelines of Twitter users

2.1 Getting the list of users

Choose a list of few hundred Twitter users and get its id. You can find some list examples at https://docs.google.com/spreadsheets/d/1tcNy1q_eQH3HXGt-0hkmSNEGbcOUiC5si3kZ6-F0pB8/ With this chunk you will get the user information of the list, including user names and counts of followers and tweets.

#Your code here

From those users we are interested on those who have written at least 100 tweets and that have at least 100 followers. From the remaining set sample 100 at random. Give it a try with dplyr:

#Your Code Here

2.2 Downloading timelines

As a test, we are going to get the last 200 tweets of the first user in our list. By setting include_rts to FALSE we will only get the original tweets among those 200,

#Your code here

Your turn: look up the documentation of rtweet and look for a function that allows you to get the timeline of a list of users. Then do the same as above but for all the users in your dataset, retrieving their last 200 tweets and only taking original tweets out of them. Make sure you save the data correctly or you might have to wait a long time to make this request again. In case you go above the limit, you can use the file “usersdf.RData” file to continue the exercise. Check the appendix to apply for a developer account and authenticate requests, you will have much higher limits and shorter waiting times in that case.

#Your code here

2.3 Aggregating and arranging data

Now on the result, we want to calculate the mean number of tweets of each user. Here group_by and summarise from dplyr will be helpful. Save the result in a data frame called RTdf and name the column with the mean number of retweets as mnRT.

#Your code here

And now we select the columns we want from the user data frame and merge it with our timeline analysis to have the data frame with the data we need to test social impact theory: the number of followers (audience size) and the mean number of retweets (social impact).

#Your code here

3. Visualize distributions and scatter plots

3.1 Distribution of the number of followers

Plot the histogram of the number of followers of users in your dataset, and the histogram of the logarithm of the number of followers. Which one is more skewed?

#load("usersdf.RData")  #in case you could not get data above, this file has 100 random US congress members
#Your code here

3.2 Distribution of social impact

Repeat the above point but for the social impact of your users, also computing the logarithm. Which one is more skewed?

#Your code here

3.3 Number of followers vs social impact

Make a scatter plot with the logarithm of the number of followers of users on the horizontal axis and the logarithm of social impact on the vertical axis. Do you guess that there is a relationship?

#Your code here

4. Fit and visualize a regression model

4.1 Fit a linear model

Make two new columns on the users data frame, one called SI with the logarithm of the amount of retweets, and another called FC with the logarithm of the amount of followers. Use the lm function to fit a model with the SI as dependent variable and FC as independent variable.

#Your code here

Print the values of the coefficient estimates of the model. Do these values support or contradict Social Impact Theory?

#Your code here

4.2 Plot fit result

Plot the same scatter plot as in 3.3. Then use the abline function to draw a line of top with the intercept being the first coefficient of the model, and the slope as the second coefficient of the model. How good does it look?

#Your code here

4.3 Calculate quality of the fit

Calculate the residuals of the model and save them in a vector. Then calculate the variance of the residuals and the variance of the social impact variable. Is the variance of the residuals lower than the variance of the dependent variable? By how much in proportion?

#Your code here

3.4 Distribution of residuals

Plot the histogram of residuals. Do they look normally distributted?

#Your code here

5. Bootstrapping

5.1 One sample

Make a new fit with a new dataset of the same size of the original but sampled with replacement. What is the value of the coefficients now?

#Your code here

5.2 Many boostrap samples

Repeat the bootstrap sample fit of the previous point 10000 times and save the values of the second coefficient in a vector.

# Your code here
# How can you do this with the boot() function?

5.3 Bootstrap histogram

Plot a histogram of the values resulting from the permutations and a vertical line on the value of the second coefficient of the original data. Use the xlim parameter of hist to make sure that both the histogram and the line can be plotted. How far is the line from the center of the histogram?

# Your code here

Conclusions

  1. Do you find any relationship between social impact and the amount of followers?

  2. How sure are you that it is larger than zero? How sure are you that it is lower than 1?

  3. Is the value of the relationship within the ranges predicted by Social Impact Theory?

  4. Under that relationship, if I have 1000 followers, how many more followers do I need to double my social impact?