Welcome to the Social Data Science dplyr tutorial! This will be a quick introduction into one of the most useful tools to handle large datasets by producing code that is easy to read.

You can find this script to run in Rstudio cloud: https://rstudio.cloud/project/892606

Introduction to dplyr

dplyr https://https://dplyr.tidyverse.org/ is a package for data manipulation, which is the process of re-sorting, rearranging, and moving data, without fundamentally changing it. It provides a way to make your data more readable and organized in an efficient way.

dplyr provides “verbs” that you can apply to the data frame that you are working on. Here we are going to see the following:

select columns from a data frame
filter rows that you want in a data frame
arrange a data frame to sort it
group_by when you want to run analyses over groups defined by one column or more
summarise to run analyses over those groups
inner_join to merge data frames very efficiently

0. Installing dplyr

Some versions of RStudio bring dplyr installed by default, but run the following chunk to install it in any case:

install.packages("dplyr")

Once you have it install, load it with the library function.

library(dplyr)

## Warning: replacing previous import 'vctrs::data_frame' by 'tibble::data_frame'
## when loading 'dplyr'

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

For this tutorial we are going to use the data from the National Health Interview Survey of 2007.

NHIS2007 <- read.csv("NHIS2007.csv")
head(NHIS2007)

##   HHX FMX FPX SEX   BMI SLEEP educ height weight
## 1  16   1   2   1 33.36     8   16     74    260
## 2  20   1   1   1 26.54     7   14     70    185
## 3  69   1   2   2 32.13     7    9     61    170
## 4  87   1   1   1 26.62     8   14     68    175
## 5  88   1   1   2 27.13     8   13     66    168
## 6  99   1   1   2 99.99    98   12     98    998

1. Selecting columns

select() lets you subset by columns. The first parameter of select() is the data frame, the second is a selection of columns. For example to get just the column BMI:

head(select(NHIS2007, BMI))

##     BMI
## 1 33.36
## 2 26.54
## 3 32.13
## 4 26.62
## 5 27.13
## 6 99.99

But you can also select subsets of colums like the first three:

head(select(NHIS2007, 1:3))

##   HHX FMX FPX
## 1  16   1   2
## 2  20   1   1
## 3  69   1   2
## 4  87   1   1
## 5  88   1   1
## 6  99   1   1

or a range:

head(select(NHIS2007, SEX:educ))

##   SEX   BMI SLEEP educ
## 1   1 33.36     8   16
## 2   1 26.54     7   14
## 3   2 32.13     7    9
## 4   1 26.62     8   14
## 5   2 27.13     8   13
## 6   2 99.99    98   12

Your turn: select the first and last columns.

# Your code here

2. Filtering rows

filter() allows you to subset rows that fulfill a condition. The second parameter is a logical statement to take only rows for which it is TRUE.

For example, to get the rows corresponding to people that sleep less than 8 hours:

head(filter(NHIS2007, SLEEP < 8))

##   HHX FMX FPX SEX   BMI SLEEP educ height weight
## 1  20   1   1   1 26.54     7   14     70    185
## 2  69   1   2   2 32.13     7    9     61    170
## 3 101   1   1   1 99.99     6   13     99    172
## 4 122   1   1   1 24.39     7   12     70    170
## 5 129   1   2   2 24.47     7   16     65    147
## 6 134   1   2   2 25.38     7   18     64    148

Your turn: get the rows with BMI between 20 and 30

# Your code here

3. Arranging to sort

arrange() sorts the rows by one or more columns in ascending or descending order

For example, to arrange the rows in the order of ascending “BMI”

head(arrange(NHIS2007, BMI))

##     HHX FMX FPX SEX   BMI SLEEP educ height weight
## 1 37567   1   4   2 12.91     7   16     96    996
## 2  5677   1   1   2 15.56     7   16     96    996
## 3 47560   1   1   2 15.84     6   13     96    996
## 4  1179   1   1   2 15.92     5   13     96    996
## 5 38609   1   1   1 16.04     5   12     96    996
## 6  7331   1   1   1 16.26    99   12     96    996

And in the order of descending “BMI”

head(arrange(NHIS2007, desc(BMI)))

##    HHX FMX FPX SEX   BMI SLEEP educ height weight
## 1   99   1   1   2 99.99    98   12     98    998
## 2  101   1   1   1 99.99     6   13     99    172
## 3  194   1   1   2 99.99     9   12     99    130
## 4 1104   1   1   2 99.99    99   13     96    996
## 5 1175   1   1   2 99.99     8   12     97    997
## 6 1550   1   2   1 99.99     8    3     98    160

Or sorted first by SLEEP and then by BMI

head(arrange(NHIS2007, SLEEP, BMI))

##     HHX FMX FPX SEX   BMI SLEEP educ height weight
## 1 21735   1   1   2 18.87     3   10     64    110
## 2 38550   1   1   2 19.75     3   14     67    126
## 3 29955   1   1   1 22.13     3   13     69    150
## 4  5454   1   1   2 22.61     3   14     66    140
## 5 38687   1   1   2 25.03     3   18     66    155
## 6 30509   1   1   1 26.68     3   12     74    208

your turn: by descending weight first and then by height

# Your code here

4. Pipes

When you load dplyr, you also load another package called magrittr, which includes a special command called the pipe (%>%). Pipes are useful to simplify code. Pipes connect two commands such that the result of the first command is the input to the first parameter of the second command.

For example, to filter first for BMI lower than 30, select the columns height and education, and then sort by height:

NHIS2007 %>%
  filter(BMI < 30) %>%
    select(height, educ) %>%
      arrange(height) %>% head()

##   height educ
## 1     59   16
## 2     59    6
## 3     59   12
## 4     59   16
## 5     59   12
## 6     59   11

Your turn: sort reversed by BMI and then fiter by weight below 150

# Your code here

5. Grouping and summarising

The commands group_by and summarise are very useful to calcualte aggregated values over groups of rows determined by one or more columns. The result has only the columns used for the group_by and the ones calculated by summarise.

NHIS2007 %>%
  group_by(SEX) %>%
    summarise(AvgBMI = mean(BMI, na.rm = TRUE)) %>% head()

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 2 x 2
##     SEX AvgBMI
##   <int>  <dbl>
## 1     1   30.9
## 2     2   32.4

Your turn: calculate the minimum height for each education level

# Your code here

6. Merging with inner join

Sometimes you have two data frames and you want to merge them according to the values of a column. The merge function of R can do that, but the inner_join function of dplyr is much, much faster:

df1 <- data.frame(a=c(1,2,3), b=c("one","two","three"))
df2 <- data.frame(a=c(2,3,3), x=c("x","y","z"), z=c(1,1,1))
head(inner_join(df1,df2))

## Joining, by = "a"

##   a     b x z
## 1 2   two x 1
## 2 3 three y 1
## 3 3 three z 1

inner_join automatically takes the common columns for the merging, but you can also specify which columns to use:

head(inner_join(df1,df2, by=c("a"="z")))

##   a   b a.y x
## 1 1 one   2 x
## 2 1 one   3 y
## 3 1 one   3 z

For further practice: Take your solution of our FOI and GDP exercise and change the merge commands to inner_join commands.

To learn more:

Tutorial on dplyr by Sean Anderson http://seananderson.ca/2014/09/13/dplyr-intro.html
Data Camp interactive tutorial on dplyr https://www.datacamp.com/courses/dplyr-data-manipulation-r-tutorial

Social Data Science: Data wrangling with dplyr

Dr. David Garcia