In this exercise, we will test the hypothesis of division of impact from Social Impact Theory against data from Reddit. This hypothesis states that the extent of social impact exercised by a source on each of the members of a group decreases with the size of the group (the number of targets). On Reddit, this happens when someone writes a post on a subreddit that has subscribers (the group) and they vote for it (social impact). The hypothesis implies that aggregating the total impact grows with the number of targets, but does so sublinearly, i.e. slower than a straight line. Thus, we expect a sublinear trend between the score of a submission (a combination of up and down votes) and the size of the subreddit in number of subscribers.
We will use a very basic access to the Reddit API to make a search for subreddits by a term, for example “Data Science”. For each subreddit in the result with more than a minimum number of subscribers, we will request information on a few of its recent posts. We will record the number of subscribers of each subreddit and the score of each post to calculate the mean number of votes per submission in the subreddit, as a measure of social impact. We will fit a regression model of impact as a function of their number of subscribers, and test this way if there is a sublinear (but positive) relationship between the size of their audience and the extent of their impact. In the end, we will assess the uncertainty in our estimate of the exponent of this relationship by bootstrapping.
Connect to the API and search for subreddits
Retrieve post statistics
Calculate relevant variables
Visualize distributions and scatter plots
Fit and visualize the regression model
Bootstrapping
1.1 Make a test search
Load the httr package and try the GET call below. What do you see in the results and in the JSON content?
library(httr)
searchresults <- GET('https://www.reddit.com/subreddits/search.json?limit=10&q=data%20science', accept_json())
searchjson <- content(searchresults, type="application/json")
Read the list of results in the object you read from JSON and take a look at the first one.
reslist <- searchjson$data$children
reslist[[1]]
1.2 Make one search
Now make a search with your own term and asking for a limit of 100 results instead of 10. Save the list from the JSON as we did above to have the information on each subreddit.
searchresults <- GET( #Your code here
, accept_json())
searchjson <- content(searchresults, type="application/json")
reslist <- searchjson$data$children
1.3 Select subreddits
Iterate over the search results to create a data frame called “subredditsdf” that contains the names of each subreddit (display_name) and the number of subscribers (subscribers). Then filter that data frame to have only subreddits with at least 1000 subscribers.
library(dplyr)
subredditsdf <- NULL
for (i in seq(1, length(reslist)))
{
subredditsdf <- rbind(subredditsdf,
data.frame(
subreddit=reslist[[i]]$data$display_name,
subscribers=reslist[[i]]$data$subscribers))
}
# Your code here to filter subreddits with less than 1000 subscribers. Save the result in subredditsdf
If less than 50 subreddits remain after filtering, go back to the subreddit search (1.2) and use a more frequent term for the search and repeat this task.
2.1 Subreddit example
Use GET as before to retrieve the frontpage of r/funny. The results allow you to see some information on each post, including the title and the score.
srd <- "funny"
srresults <- GET(paste0('https://www.reddit.com/r/',srd,'.json'), accept_json())
srjson <- content(srresults, type="application/json")
srlist <- srjson$data$children
srlist[[1]]$data$title
srlist[[1]]$data$score
2.2 Gather subreddits data
For each subreddit in your data frame, make a request for its frontpage of posts. Save the score of each post in another dataframe called “postsdf” with two columns, one with the name of the subreddit of the post and another with the score of that post. Using the function Sys.sleep, add a random waiting period between requests, for example sampled uniformly between 10 and 15 seconds. You can tell if you are making too many requests by checking if the field “error” of the json result is not NULL. In that case you can code an extra waiting period and retry to get the posts of that subreddit and continue.
postsdf <- NULL
for (srd in subredditsdf$subreddit)
{
Sys.sleep(runif(n=1,min=10,max=15))
print(paste(srd, length(unique(postsdf$subreddit))))
postsresults <- GET(paste0('https://www.reddit.com/r/',srd,'.json'), accept_json())
postsjson <- content(postsresults, type="application/json")
while (! is.null(postsjson$error))
{
Sys.sleep(runif(n=1,min=10,max=15))
print(paste("repeating", srd))
postsresults <- GET(paste0('https://www.reddit.com/r/',srd,'.json'), accept_json())
postsjson <- content(postsresults, type="application/json")
}
postslist <- postsjson$data$children
for (i in seq(1, length(postslist)))
{
postsdf <- rbind(postsdf, data.frame(score=postslist[[i]]$data$score, subreddit=srd))
}
}
3.2 Aggregate and calculate post-level variables
Now on the result, we want to calculate the mean score of the posts of each subreddit. Here group_by and summarise from dplyr will be helpful.
postsdf %>% group_by(subreddit) %>% #Your code to aggregate here, save the resul int postsAggdf
3.3 Merge with subreddit data
Merge your two data frames: the one with the number of subscribers and the one with the mean score of posts. Here, the function inner_join of dplyr can be helpful.
srdf <- inner_join(postsAggdf, subredditsdf)
4.1 Distribution of the number of subscribers
Plot the histogram of the number of subscribers of subreddits in your dataset, and the histogram of the logarithm of the number of subscribers. Which one is more skewed?
hist(srdf$subscribers)
hist(log(srdf$subscribers))
4.2 Distribution of social impact
Repeat the above point but for the social impact of posts in each subreddit, also computing the logarithm. Which one is more skewed?
hist(srdf$mnScore)
hist(log(srdf$mnScore))
4.3 Number of subscribers vs social impact
Make a scatter plot with the logarithm of the number of followers of users on the horizontal axis and the logarithm of social impact on the vertical axis. Do you guess that there is a relationship?
# Your code here
5.1 Fit a linear model
Make two new columns on the users data frame, one called SI with the logarithm of the mean score, and another called FC with the logarithm of the number of subscribers. Use the lm function to fit a model with the SI as dependent variable and FC as independent variable.
# Your code here
Print the values of the coefficient estimates of the model. Do these values support or contradict Social Impact Theory?
# Your code here
5.2 Plot fit result
Plot the same scatter plot as in 3.3. Then use the abline function to draw a line of top with the intercept being the first coefficient of the model, and the slope as the second coefficient of the model. Add a second line with slope 1 and an intercept that allows you to see it in the range of the plot. Which slope do you think is closer to the slope you see in the scatter plot?
# Your code here
5.3 Calculate quality of the fit
Calculate the residuals of the model and save them in a vector. Then calculate the variance of the residuals and the variance of the social impact variable. Is the variance of the residuals lower than the variance of the dependent variable? By how much in proportion?
# Your code here
5.4 Distribution of residuals
Plot the histogram of residuals. Do they look normally distributted?
# Your code here
6.1 One sample
Make a new fit with a new dataset of the same size of the original but sampled with replacement. What is the value of the coefficients now?
# Adapt this code to work with your data frame name and column names
ids <- sample(nrow(srdf), replace=T)
bootmodel <- lm(srdf$SI[ids]~srdf$FC[ids])
bootmodel
6.2 Many boostrap samples
Repeat the bootstrap sample fit of the previous point 10000 times and save the values of the second coefficient in a vector.
# Adapt this code to work with your data frame name and column names
bootVs <- rep(0, 10000)
for (i in seq(1,10000))
{
ids <- sample(x = seq(1,nrow(srdf)), size = nrow(srdf), replace=T)
bootmodel <- lm(srdf$SI[ids]~srdf$FC[ids])
bootVs[i] <- bootmodel$coefficients[2]
}
6.3 Bootstrap histogram
Plot a histogram of the values resulting from the permutations and a vertical line on the value of the second coefficient of the original data. Use the xlim parameter of hist to make sure that both the histogram and the line can be plotted. How far is the line from the center of the histogram?
hist(bootVs, xlim=range(c(bootVs, model$coefficients[2])))
abline(v=model$coefficients[2], lty=2, col="green", lwd=2)
6.4 Extra: a version of above with boot()
library(boot)
# Adapt this code to work with your data frame name and column names
bootSlope <- function(df, ids)
{
bootmodel <- lm(df$SI[ids]~df$FC[ids])
return(bootmodel$coefficients[2])
}
bootRes <- boot(data = srdf, statistic = bootSlope, R = 10000)
hist(bootRes$t, xlim=range(c(bootRes$t, model$coefficients[2])))
abline(v=model$coefficients[2], lty=2, col="green", lwd=2)
Do you find any relationship between social impact and the number of subscribers?
How sure are you that it is larger than zero? How sure are you that it is lower than 1?
Is the value of the exponent of the model within the ranges assumed by Social Impact Theory?
Under that relationship, if a subreddit has 1000 subscribers, how many more subscribers does it need to double the social impact of posts in it?
What is the problem with the distribution of residuals of the model? Can you link that to the scatter plot with the model fit?