You can find the markdown files and data for this exercise in the corresponding Github folder.
Load the timeline data of Twitter user accounts of Swiss politicians
Build social network of retweets
Calculate assortativity
Permutation tests
Community detection
First load the packages we will use in this exercise: dplyr, tidygraph, jsonlite, and ggraph.
library(dplyr)
library(tidygraph)
library(jsonlite)
library(ggraph)
Download the file SwissPoliticians.csv and read it as a csv in R. Take into account that separators are tabs. Change the screen names of accounts to lower case and add a column with a sequential id from 1 to the number of politicians.
poldf <- read.csv("SwissPoliticians.csv",sep="\t",header=TRUE, stringsAsFactors=FALSE)
poldf$screenName <- tolower(poldf$screenName)
poldf$id <- seq(1, nrow(poldf))
Read the politician tweets file taking into account that it is compressed. Print a random line and its content read as JSON. Check Exercise 2 (SIT on Twitter) if you need an example of how to do this.
lines <- readLines(gzfile("SwissPoliticians-tweets.json.gz"))
line <- lines[sample(length(lines), 1)]
line
fromJSON(line)
Iterate over all the lines you read from the file, interpreting each one as a JSON object with the data of a tweet. For each tweet that is a retweet, save the screen name of the user who tweeted it and the screen name of the user who made the tweet being retweeted. Save these two in a data frame with two columns.
userName <- NULL
RTuserName <- NULL
for (line in lines)
{
tweet <- fromJSON(line)
if (!is.null(tweet$retweeted_status$id_str))
{
userName[length(userName)+1] <- tweet$user$screen_name
RTuserName[length(RTuserName)+1] <- tweet$retweeted_status$user$screen_name
}
}
tweetsdf <- data.frame(userName = tolower(userName), RTuserName = tolower(RTuserName))
As a last step, filter the data frame to remove cases in which a politician was retweeting themselves. How many tweets did you have in the dataset before and after this filter?
nrow(tweetsdf)
tweetsdf %>% filter(userName != RTuserName) -> tweetsdf
nrow(tweetsdf)
Use the graph_assortativity function to calculate the assortativity with respect to party labels. How high is the value?
#Your code here
To see if the assortativity value fits your expectations, use ggraph to plot the network coloring each node according to the political party label of the politician. Does the pattern of colors fit the value of assortativity?
#Your code here
The above result looks assortative, but how can we test if it could have happened at random and not because of party identity? Here were are going to test it with a permutation test.
First, let’s run a permutation. Perform the same assortativity calculation as above but permuting the party labels of nodes. You can do this very efficiently by using the sample() function when you call the graph_assortativity() function.
graph %>%
mutate(assort=graph_assortativity(sample(party))) %>%
pull(assort) %>%
head(1)
Is the value much closer to zero? Repeat the calculation with 1000 permutations and plot the histogram of the resulting values. Add a line with the value of the assortativity without permutation. Is it far or close to the permuted values?
N <- 1000
permassort <- NULL
for (i in seq(1,N))
{
#Your code here
}
graph %>%
mutate(assort=graph_assortativity(party)) %>%
pull(assort) %>%
head(1) -> res
hist(permassort, xlim=range(c(res, permassort)))
abline(v=res, col="red")
To be sure, let’s calculate a p-value for the null hypothesis that the assortativity is zero and the alternative hypothesis that it is positive (what we expected):
#Your code here
After looking at the above results, do you think it is likely that the assortativity we found in the data was produced by chance?
Let’s test if Twitter communities match political affiliations. Remove nodes with degree zero in the network and run the Louvain community detection algorithm. Visualize the result coloring nodes by community labels (cast the label to a character so you have distinct colors).
graph %>%
activate(nodes) %>%
mutate(deg = centrality_degree()) %>%
filter(deg>0) -> graph
graph %>%
activate(nodes) %>%
mutate(community = group_walktrap()) -> graph
graph %>%
ggraph("fr") + geom_edge_link() + geom_node_point(aes(color=as.character(community))) + theme_graph()
Run the graph_modularity function with the above community labels. Is it high enough to think that the network has a community structure?
graph %>%
mutate(modularity = graph_modularity(community)) %>%
pull(modularity) %>%
head(1)
Repeat but using the party labels (you might have to cast to factor) instead of the communities detected with Louvain. Is it higher or lower? How far is this modularity from the maximal one found with Louvain?
#Your code here
Finally, to understand which parties are represented in each community, build a data frame for nodes with two columns: one with the party label and another one with the community label. Use the table() function to print a contingency table. Can you guess which party or parties compose each community?
graph %>%
activate(nodes) %>%
as_tibble() -> nodesdf
table(nodesdf$party, nodesdf$community)