Welcome to the R tutorial of Social Data Science! Exercise sessions and self-study tasks will help you to apply the knowledge learned in class and gain experience.

Introduction to R

R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS. In this course we will be using R in the RStudio environment to perform our exercises. Exercise sheets and solutions will be R Markdown documents that combine and format text, code, and code output.

Follow the instructions from the links provided below to download and install R and RStudio on your computer:

R https://www.r-project.org
RStudio https://www.rstudio.com R Markdown http://rmarkdown.rstudio.com/

Once you have successfully installed R and RStudio, you can follow through the basic commands below to get familiar with R. If everything else fails, you can run this crash course in R Studio Cloud - https://rstudio.cloud/project/852344

Remember that there is a great community using R and you can always search online for the way to do things.

Crash course overview: - Simple operations - Control flow - Sampling and histograms - Reading data - Data frames - Plotting and summary statistics

0. Getting help

If you donā€™t know what a function does, try searching for help with ā€œ?ā€

? mean

If you donā€™t know how to do something, your fellow programmer is your friend. For example:

https://stats.stackexchange.com/questions/157661/how-to-calculate-mean-median-mode-std-dev-from-distribution

This is a really useful search engine (Google plus filter for R-relevant content): https://rseek.org/

1. Simple operations

You can use either ā€œ<-ā€ or ā€œ=ā€ to assign a value to a variable.

a <- 5
b = c(2,4,6,8)
d <- c(3,5,7,9)

The ā€œcā€ in the above vector assignment stands for combine into a vector. The elements in the array are indexed from 1 to n.

You can see the result by typing the variable name to the console.

a
## [1] 5
b
## [1] 2 4 6 8
d[3]
## [1] 7

Now we can do some arithmetic with the vectors.

a*b
## [1] 10 20 30 40
b+d
## [1]  5  9 13 17
b+1
## [1] 3 5 7 9

Notice when a vector is multiplied with a scalar, each of the element is multiplied. And when vectors are added/multiplied, they must be the same dimension and the arithmetic happens elementwise.

Boolean values can also be stored and manipulated in R

b1 <- TRUE
b2 <- 1>2
b1 & b2
## [1] FALSE
b1 | b2
## [1] TRUE

rep, seq, and rev are useful functions to produce and manipulate simple vectors

rep(1,7)
## [1] 1 1 1 1 1 1 1
seq(1,7,by=2)
## [1] 1 3 5 7
rev(seq(1,7))
## [1] 7 6 5 4 3 2 1

More examples:

We create a vector V with the even numbers between 1 and 10. Show its content.

V <- seq(by =2, from=2, to =10)
V
## [1]  2  4  6  8 10

We look at its third entry and test if it is larger than 3.

V[3] > 3
## [1] TRUE

Reverse its order and divide each of its entries by 2.

rev(V/2)
## [1] 5 4 3 2 1

Your turn:

Calculate the reminder of 10 divided by 7 (i.e.Ā 10 modulo 7). (Hint: search in Google or Stackoverflow)

# Your code here

2. Control flow

R is a programming language after all, how do we check for conditions or go through iterations?

if/else statements allow you to check for conditions:

x <- 5
if (x>4)
{
  print("larger than four")
} 
## [1] "larger than four"

for loops are fixed length iterations:

sequence <- seq(1,5)
for (i in sequence)
{
  print(i+1)
}
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6

while loops let you iterate as long as a condition is met:

i <- 1
while (i<5)
{
  print(i) #infinte loop!!!!!!
  i <- i +1
}

Note: control flow is very inefficient in R, we will learn faster methods for large datasets later in the course

Your turn: Iterate over the numbers from 1 to 50 and print the ones divisible by 7.

#Your code here

3. Reading data

Data is essential for our tasks. A table can be easily imported from local files as a csv file via read.csv(). Or you may try other functions like read.table() and adjust to different data formats. In this exercise, we use a survey result stored in a .csv file:

testDF <- read.csv("TutorialHeights_Test.csv", header = TRUE, sep = ",", quote = "\"",
                           stringsAsFactors = FALSE)
head(testDF, n=10)
##            Timestamp Age Gender Height
## 1  2/3/2017 14:35:29  25   Male  183.0
## 2  2/3/2017 14:40:08  25 Female  170.0
## 3  2/3/2017 15:11:44  24   Male  178.0
## 4  2/3/2017 15:12:00  27 Female  160.0
## 5  2/3/2017 15:12:12  21 Female  166.0
## 6  2/3/2017 15:56:21  30    XXX  172.0
## 7  2/3/2017 16:06:32  22 Female  165.6
## 8  2/3/2017 16:15:53  19   Male  190.0
## 9  2/3/2017 16:25:12  23   male  178.9
## 10 2/3/2017 16:25:29  27   male  169.8
SurveyDF <- testDF

The read.csv and read.table functions have some important parameters:

Another example: Open the file ā€œTutorialHeightsSurvey.datā€ in a plain text reader to see its content. Then load it into R and name it SurveyDF. Print its first 7 lines.

SurveyDF <- read.table(file = "TutorialHeightsSurvey.dat", header=T, sep = "\t",stringsAsFactors = F )
head(SurveyDF,n=7)
##           Timestamp Age Gender Height
## 1 2/3/2017 14:35:29  25   Male  183.0
## 2 2/3/2017 14:40:08  25 Female  170.0
## 3 2/3/2017 15:11:44  24   Male  178.0
## 4 2/3/2017 15:12:00  27 Female  160.0
## 5 2/3/2017 15:12:12  21 Female  166.0
## 6 2/3/2017 15:56:21  30    Man  172.0
## 7 2/3/2017 16:06:32  22 Female  165.6

4. Data frames

Data frames are a general way to store multimodal data in R. They are composed of rows with one value in each column.

names(SurveyDF)
## [1] "Timestamp" "Age"       "Gender"    "Height"
head(SurveyDF)
##           Timestamp Age Gender Height
## 1 2/3/2017 14:35:29  25   Male    183
## 2 2/3/2017 14:40:08  25 Female    170
## 3 2/3/2017 15:11:44  24   Male    178
## 4 2/3/2017 15:12:00  27 Female    160
## 5 2/3/2017 15:12:12  21 Female    166
## 6 2/3/2017 15:56:21  30    Man    172

You can access individual rows:

SurveyDF[1:10,]
##            Timestamp Age Gender Height
## 1  2/3/2017 14:35:29  25   Male  183.0
## 2  2/3/2017 14:40:08  25 Female  170.0
## 3  2/3/2017 15:11:44  24   Male  178.0
## 4  2/3/2017 15:12:00  27 Female  160.0
## 5  2/3/2017 15:12:12  21 Female  166.0
## 6  2/3/2017 15:56:21  30    Man  172.0
## 7  2/3/2017 16:06:32  22 Female  165.6
## 8  2/3/2017 16:15:53  19   Male  190.0
## 9  2/3/2017 16:25:12  23   male  178.9
## 10 2/3/2017 16:25:29  27   male  169.8

And individual values by position:

SurveyDF[3,4]
## [1] 178

Columns in dataframes are accessed with the ā€œ$ā€ operator:

SurveyDF$Height
##  [1] 183.0 170.0 178.0 160.0 166.0 172.0 165.6 190.0 178.9 169.8 177.9 185.0
## [13] 186.0 177.0 165.0 165.0 161.0 167.0 201.0 172.0 182.3 163.0 171.0

You can index entries in the column:

SurveyDF$Height[1:3]
## [1] 183 170 178

You can add a column

SurveyDF$sequence <- c(1:nrow(SurveyDF))
head(SurveyDF, n=3)
##           Timestamp Age Gender Height sequence
## 1 2/3/2017 14:35:29  25   Male    183        1
## 2 2/3/2017 14:40:08  25 Female    170        2
## 3 2/3/2017 15:11:44  24   Male    178        3

You can manually produce your own data frame. Use NA (Not Available) to mark missing values.

newrow <- data.frame(Timestamp=NA, Age=31, Gender="Male", Height=185, sequence=0)
print(newrow)
##   Timestamp Age Gender Height sequence
## 1        NA  31   Male    185        0

And add to the other dataframe row-wise

SurveyDF2 <- rbind(SurveyDF, newrow)
tail(SurveyDF2)
##             Timestamp Age Gender Height sequence
## 19 2/23/2017 15:20:52  12 Female  201.0       19
## 20 2/23/2017 15:52:14  27   Male  172.0       20
## 21 2/23/2017 16:16:31  26   male  182.3       21
## 22 2/23/2017 16:17:12  22 female  163.0       22
## 23 2/23/2017 16:18:40  23 female  171.0       23
## 24               <NA>  31   Male  185.0        0
df <- NULL

Your turn:

Save the first, third, and fifth rows of SurveyDF in another data frame and print its first column.

#Your code here

Another example: Print the heights of the rows in SurveyDF of gender ā€œfemaleā€.

# change this code for the data frame and columns to fit your code
SurveyDF$Height[SurveyDF$Gender=="female"]
## [1] 167 163 171

5. Plotting and statistics: How tall are we?

We can produce a simple scatterplot of the data using the plot() function

plot(SurveyDF$Age, SurveyDF$Height)

We are aware that height might depend on gender. What genders do we have in the dataset?

unique(SurveyDF$Gender)
## [1] "Male"   "Female" "Man"    "male"   "other"  "female"

Some genders might not be in our list of values and capitalization should be taken into account. To clean genders:

SurveyDF$Gender <- tolower(SurveyDF$Gender)
err <- SurveyDF$Gender!= "male" & SurveyDF$Gender != "female"
SurveyDF$Gender[err] <- NA
SurveyDF$Gender <- as.factor(SurveyDF$Gender)
unique(SurveyDF$Gender)
## [1] male   female <NA>  
## Levels: female male

Now we make a better scatter plot with points colored by gender and proper axes labels:

plot(SurveyDF$Age, SurveyDF$Height, xlab="Age", 
     ylab="Heights (cm)", main="Scatter Plot of Height~Age", pch=19,
     col=c("blue","red")[SurveyDF$Gender])
legend("bottomright", legend = levels(SurveyDF$Gender),
       col=c("blue","red"), pch=19)

What is the mean height of each gender, and how much do they vary?

MaleDF <- SurveyDF[SurveyDF$Gender=="male",]
mean(MaleDF$Height, na.rm=TRUE)
## [1] 178.7417
sqrt(var(MaleDF$Height, na.rm=TRUE))
## [1] 7.185523
FemaleDF <- SurveyDF[SurveyDF$Gender=="female",]
mean(FemaleDF$Height, na.rm=TRUE)
## [1] 169.8444
sd(FemaleDF$Height, na.rm=TRUE)
## [1] 12.14569

Your turn:

Sort the height values and plot them in sequence.

#Your code here

Print the mean and median height. Then print the standard deviation of the age of females (Hint: check ?sd).

#Your code here

6. Sampling and histograms

Now we will learn about the Gaussian distribution and plot it in R. The rnorm function lets you sample values from a normal distribution, and hist shows a histogram of values.

# Set seed for random generator
set.seed(23-2-2017)
# Generate 100000 random numbers from normal distribution
RandomNum <- rnorm(100000, mean=0, sd=1)
# Calculate and plot histogram
hist(RandomNum)

The dnorm function returns the value of the Gaussian density function at the specified point. Below are a few examples and a plot for a range of values.

dnorm(0)
## [1] 0.3989423
dnorm(1, mean=2, sd=2)
## [1] 0.1760327
x <- seq(-5,5,by=.1)
y <- dnorm(x)
plot(x,y, type="l")

Your turn:

Plot the histogram of 1000 values sampled from the uniform distribution between -10 and 10. If you get lost, type ?distribution or search online.

#Your code here

7. To practice moreā€¦

Load the data from the file BMI-steps.csv. Is there a correlation between BMI and steps? Is it different for women and men? Do you notice anything else?

df <- read.csv("BMI-steps.csv")
#Your code here

8. To learn more about R and Markdown