In this tutorial we will learn how to retrieve different kinds of network data from Twitter and how to process them in R.

Reply networks

First, we retrieve the basic profile information of the users of a Twitter list. Here we are using the list of professor and department accounts of ETH Zurich (https://twitter.com/i/lists/984765033673646080). The function list_members retrieves the user ids in the list and the profile objects for each user:

library(dplyr)
users <- lists_members(list_id = 984765033673646080)
head(users)
## # A tibble: 6 x 40
##   user_id name  screen_name location description url   protected followers_count
##   <chr>   <chr> <chr>       <chr>    <chr>       <chr> <lgl>               <int>
## 1 135624… ETH … ETHSPH      ""       ""          <NA>  FALSE                  11
## 2 132315… Seis… SWP_ETHZur… "Zurich… "Twitter a… http… FALSE                 213
## 3 130949… Juli… DannathJul… "Zurich… "This is t… http… FALSE                 167
## 4 128555… Quan… ETHQuantum… "Zurich… "The offic… http… FALSE                 199
## 5 127577… ETH … ETH_Mesosys "ETH Zu… "Joint lab… http… FALSE                  43
## 6 127026… Cent… pblcenter_… "Zurich" "PBL Cente… http… FALSE                  31
## # … with 32 more variables: friends_count <int>, listed_count <int>,
## #   created_at <dttm>, favourites_count <int>, utc_offset <lgl>,
## #   time_zone <lgl>, geo_enabled <lgl>, verified <lgl>, statuses_count <int>,
## #   lang <lgl>, contributors_enabled <lgl>, is_translator <lgl>,
## #   is_translation_enabled <lgl>, profile_background_color <chr>,
## #   profile_background_image_url <chr>,
## #   profile_background_image_url_https <chr>, profile_background_tile <lgl>,
## #   profile_image_url <chr>, profile_image_url_https <chr>,
## #   profile_link_color <chr>, profile_sidebar_border_color <chr>,
## #   profile_sidebar_fill_color <chr>, profile_text_color <chr>,
## #   profile_use_background_image <lgl>, has_extended_profile <lgl>,
## #   default_profile <lgl>, default_profile_image <lgl>, following <lgl>,
## #   follow_request_sent <lgl>, notifications <lgl>, translator_type <chr>,
## #   profile_banner_url <chr>
nrow(users)
## [1] 246

We are going to focus on a random sample of 100 and retrieve their latest 100 tweets with the get_timelines() function:

userids <- sample(users$user_id, 100)
tweets <- get_timelines(userids, n=100)
head(tweets)
## # A tibble: 6 x 90
##   user_id status_id created_at          screen_name text  source
##   <chr>   <chr>     <dttm>              <chr>       <chr> <chr> 
## 1 489555… 13112677… 2020-09-30 11:32:38 NEZimmerma… "Gre… Twitt…
## 2 489555… 13064968… 2020-09-17 07:34:48 NEZimmerma… "Nic… Twitt…
## 3 489555… 12368906… 2020-03-09 05:44:37 NEZimmerma… "#Ph… Twitt…
## 4 489555… 12368906… 2020-03-09 05:44:34 NEZimmerma… "#Ph… Twitt…
## 5 489555… 11971016… 2019-11-20 10:37:19 NEZimmerma… "Ver… Twitt…
## 6 489555… 11772254… 2019-09-26 14:16:30 NEZimmerma… "Sup… Twitt…
## # … with 84 more variables: display_text_width <dbl>, reply_to_status_id <chr>,
## #   reply_to_user_id <chr>, reply_to_screen_name <chr>, is_quote <lgl>,
## #   is_retweet <lgl>, favorite_count <int>, retweet_count <int>,
## #   quote_count <int>, reply_count <int>, hashtags <list>, symbols <list>,
## #   urls_url <list>, urls_t.co <list>, urls_expanded_url <list>,
## #   media_url <list>, media_t.co <list>, media_expanded_url <list>,
## #   media_type <list>, ext_media_url <list>, ext_media_t.co <list>,
## #   ext_media_expanded_url <list>, ext_media_type <chr>,
## #   mentions_user_id <list>, mentions_screen_name <list>, lang <chr>,
## #   quoted_status_id <chr>, quoted_text <chr>, quoted_created_at <dttm>,
## #   quoted_source <chr>, quoted_favorite_count <int>,
## #   quoted_retweet_count <int>, quoted_user_id <chr>, quoted_screen_name <chr>,
## #   quoted_name <chr>, quoted_followers_count <int>,
## #   quoted_friends_count <int>, quoted_statuses_count <int>,
## #   quoted_location <chr>, quoted_description <chr>, quoted_verified <lgl>,
## #   retweet_status_id <chr>, retweet_text <chr>, retweet_created_at <dttm>,
## #   retweet_source <chr>, retweet_favorite_count <int>,
## #   retweet_retweet_count <int>, retweet_user_id <chr>,
## #   retweet_screen_name <chr>, retweet_name <chr>,
## #   retweet_followers_count <int>, retweet_friends_count <int>,
## #   retweet_statuses_count <int>, retweet_location <chr>,
## #   retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
## #   place_name <chr>, place_full_name <chr>, place_type <chr>, country <chr>,
## #   country_code <chr>, geo_coords <list>, coords_coords <list>,
## #   bbox_coords <list>, status_url <chr>, name <chr>, location <chr>,
## #   description <chr>, url <chr>, protected <lgl>, followers_count <int>,
## #   friends_count <int>, listed_count <int>, statuses_count <int>,
## #   favourites_count <int>, account_created_at <dttm>, verified <lgl>,
## #   profile_url <chr>, profile_expanded_url <chr>, account_lang <lgl>,
## #   profile_banner_url <chr>, profile_background_url <chr>,
## #   profile_image_url <chr>

Accessing timelines in Twitter has a much higher limit of 900 requests per 15-minute window. This way you can get at most 3200 tweets per user and you have another limit of at most 100000 calls per 24-hour period.

Tweets that are replies will generate rows in the result with the reply_to_user_id column set to the numeric id of the user being replied to:

head(tweets$reply_to_user_id, n=20)
##  [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

We can extract all reply links in our data by taking this information and filtering for users in our sample:

tweets %>% 
  select(user_id, reply_to_user_id) %>% 
  filter(!is.na(reply_to_user_id)) %>% 
  inner_join(data.frame(user_id=userids)) %>%
  inner_join(data.frame(reply_to_user_id=userids)) -> repliesdf
## Joining, by = "user_id"
## Joining, by = "reply_to_user_id"
dim(repliesdf)
## [1] 547   2

And construct the edges by counting reply instances as their weight:

names(repliesdf) <- c("from", "to")
repliesdf %>% 
  group_by(to, from) %>%
  summarize(weight = n()) -> edgesdf
## `summarise()` has grouped output by 'to'. You can override using the `.groups` argument.
head(edgesdf)
## # A tibble: 6 x 3
## # Groups:   to [5]
##   to                  from                weight
##   <chr>               <chr>                <int>
## 1 1025692751847989248 1025692751847989248     22
## 2 1031222219131834369 1031222219131834369     24
## 3 1039125456086282240 1039125456086282240      3
## 4 1047807847738888192 1047807847738888192      3
## 5 1064812518206554113 1064812518206554113      1
## 6 1064812518206554113 346179882                1

We can then create the graph and plot it:

library(tidygraph)
## 
## Attaching package: 'tidygraph'
## The following object is masked from 'package:stats':
## 
##     filter
graph<-tbl_graph(data.frame(id=userids),edgesdf, directed=F, node_key = "id")
plot(graph)

If you see many self-loops, that might be due to accounts building threads of replies.

Retweet networks

You can follow a similar process to build a retweet network. First, we extract entries using the retweet_user_id column:

tweets %>% 
  select(user_id, retweet_user_id) %>% 
  filter(!is.na(retweet_user_id)) %>% 
  inner_join(data.frame(user_id=userids)) %>%
  inner_join(data.frame(retweet_user_id=userids)) -> retweetsdf
## Joining, by = "user_id"
## Joining, by = "retweet_user_id"
dim(retweetsdf)
## [1] 230   2

We then construct edges. The directionality of retween networks can be decided in both ways, either from retweeted to retweeted as an information flow network or from retweeter to retweeted as a “giving credit” or endorsement network. We choose this last option to resemple the follower networks:

names(retweetsdf) <- c("from", "to")
retweetsdf %>% 
  group_by(to, from) %>%
  summarize(weight = n()) -> edgesdf
## `summarise()` has grouped output by 'to'. You can override using the `.groups` argument.
head(edgesdf)
## # A tibble: 6 x 3
## # Groups:   to [6]
##   to                  from                weight
##   <chr>               <chr>                <int>
## 1 1031222219131834369 295507571                1
## 2 1047808609210589184 934090570930286593       1
## 3 1064812518206554113 346179882               15
## 4 1073187048935342080 1073187048935342080      2
## 5 1074588131222020097 1074588131222020097      3
## 6 1088464875519655937 818560246654201860       1

And then we can plot as we did with the replies network:

graph<-tbl_graph(data.frame(id=userids),edgesdf, directed=F, node_key = "id")
plot(graph)