Welcome to the Social Data Science dplyr tutorial! This will be a quick introduction into one of the most useful tools to handle large datasets by producing code that is easy to read.
You can find this script to run in Rstudio cloud: https://rstudio.cloud/project/892606
dplyr https://https://dplyr.tidyverse.org/ is a package for data manipulation, which is the process of re-sorting, rearranging, and moving data, without fundamentally changing it. It provides a way to make your data more readable and organized in an efficient way.
dplyr provides “verbs” that you can apply to the data frame that you are working on. Here we are going to see the following:
Some versions of RStudio bring dplyr installed by default, but run the following chunk to install it in any case:
install.packages("dplyr")
Once you have it install, load it with the library function.
library(dplyr)
## Warning: replacing previous import 'vctrs::data_frame' by 'tibble::data_frame'
## when loading 'dplyr'
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
For this tutorial we are going to use the data from the National Health Interview Survey of 2007.
NHIS2007 <- read.csv("NHIS2007.csv")
head(NHIS2007)
## HHX FMX FPX SEX BMI SLEEP educ height weight
## 1 16 1 2 1 33.36 8 16 74 260
## 2 20 1 1 1 26.54 7 14 70 185
## 3 69 1 2 2 32.13 7 9 61 170
## 4 87 1 1 1 26.62 8 14 68 175
## 5 88 1 1 2 27.13 8 13 66 168
## 6 99 1 1 2 99.99 98 12 98 998
select() lets you subset by columns. The first parameter of select() is the data frame, the second is a selection of columns. For example to get just the column BMI:
head(select(NHIS2007, BMI))
## BMI
## 1 33.36
## 2 26.54
## 3 32.13
## 4 26.62
## 5 27.13
## 6 99.99
But you can also select subsets of colums like the first three:
head(select(NHIS2007, 1:3))
## HHX FMX FPX
## 1 16 1 2
## 2 20 1 1
## 3 69 1 2
## 4 87 1 1
## 5 88 1 1
## 6 99 1 1
or a range:
head(select(NHIS2007, SEX:educ))
## SEX BMI SLEEP educ
## 1 1 33.36 8 16
## 2 1 26.54 7 14
## 3 2 32.13 7 9
## 4 1 26.62 8 14
## 5 2 27.13 8 13
## 6 2 99.99 98 12
Your turn: select the first and last columns.
# Your code here
filter() allows you to subset rows that fulfill a condition. The second parameter is a logical statement to take only rows for which it is TRUE.
For example, to get the rows corresponding to people that sleep less than 8 hours:
head(filter(NHIS2007, SLEEP < 8))
## HHX FMX FPX SEX BMI SLEEP educ height weight
## 1 20 1 1 1 26.54 7 14 70 185
## 2 69 1 2 2 32.13 7 9 61 170
## 3 101 1 1 1 99.99 6 13 99 172
## 4 122 1 1 1 24.39 7 12 70 170
## 5 129 1 2 2 24.47 7 16 65 147
## 6 134 1 2 2 25.38 7 18 64 148
Your turn: get the rows with BMI between 20 and 30
# Your code here
arrange() sorts the rows by one or more columns in ascending or descending order
For example, to arrange the rows in the order of ascending “BMI”
head(arrange(NHIS2007, BMI))
## HHX FMX FPX SEX BMI SLEEP educ height weight
## 1 37567 1 4 2 12.91 7 16 96 996
## 2 5677 1 1 2 15.56 7 16 96 996
## 3 47560 1 1 2 15.84 6 13 96 996
## 4 1179 1 1 2 15.92 5 13 96 996
## 5 38609 1 1 1 16.04 5 12 96 996
## 6 7331 1 1 1 16.26 99 12 96 996
And in the order of descending “BMI”
head(arrange(NHIS2007, desc(BMI)))
## HHX FMX FPX SEX BMI SLEEP educ height weight
## 1 99 1 1 2 99.99 98 12 98 998
## 2 101 1 1 1 99.99 6 13 99 172
## 3 194 1 1 2 99.99 9 12 99 130
## 4 1104 1 1 2 99.99 99 13 96 996
## 5 1175 1 1 2 99.99 8 12 97 997
## 6 1550 1 2 1 99.99 8 3 98 160
Or sorted first by SLEEP and then by BMI
head(arrange(NHIS2007, SLEEP, BMI))
## HHX FMX FPX SEX BMI SLEEP educ height weight
## 1 21735 1 1 2 18.87 3 10 64 110
## 2 38550 1 1 2 19.75 3 14 67 126
## 3 29955 1 1 1 22.13 3 13 69 150
## 4 5454 1 1 2 22.61 3 14 66 140
## 5 38687 1 1 2 25.03 3 18 66 155
## 6 30509 1 1 1 26.68 3 12 74 208
your turn: by descending weight first and then by height
# Your code here
When you load dplyr, you also load another package called magrittr, which includes a special command called the pipe (%>%). Pipes are useful to simplify code. Pipes connect two commands such that the result of the first command is the input to the first parameter of the second command.
For example, to filter first for BMI lower than 30, select the columns height and education, and then sort by height:
NHIS2007 %>%
filter(BMI < 30) %>%
select(height, educ) %>%
arrange(height) %>% head()
## height educ
## 1 59 16
## 2 59 6
## 3 59 12
## 4 59 16
## 5 59 12
## 6 59 11
Your turn: sort reversed by BMI and then fiter by weight below 150
# Your code here
The commands group_by and summarise are very useful to calcualte aggregated values over groups of rows determined by one or more columns. The result has only the columns used for the group_by and the ones calculated by summarise.
NHIS2007 %>%
group_by(SEX) %>%
summarise(AvgBMI = mean(BMI, na.rm = TRUE)) %>% head()
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
## SEX AvgBMI
## <int> <dbl>
## 1 1 30.9
## 2 2 32.4
Your turn: calculate the minimum height for each education level
# Your code here
Sometimes you have two data frames and you want to merge them according to the values of a column. The merge function of R can do that, but the inner_join function of dplyr is much, much faster:
df1 <- data.frame(a=c(1,2,3), b=c("one","two","three"))
df2 <- data.frame(a=c(2,3,3), x=c("x","y","z"), z=c(1,1,1))
head(inner_join(df1,df2))
## Joining, by = "a"
## a b x z
## 1 2 two x 1
## 2 3 three y 1
## 3 3 three z 1
inner_join automatically takes the common columns for the merging, but you can also specify which columns to use:
head(inner_join(df1,df2, by=c("a"="z")))
## a b a.y x
## 1 1 one 2 x
## 2 1 one 3 y
## 3 1 one 3 z
For further practice: Take your solution of our FOI and GDP exercise and change the merge commands to inner_join commands.
To learn more: