Processing math: 100%
+ - 0:00:00
Notes for current slide
Notes for next slide

Linear Regression

David Garcia

ETH Zurich

Social Data Science

1 / 12

Linear Regression

Regression models formalize an equation in which one numeric variable Y is formulated as a linear function of other variables X1, X2, X3, etc:

Y=a+b1X1+b2X2+b3X3...+ϵ
  • Y is called the dependent variable

  • X1, X2, X3, etc are called independent variables

  • a is the intercept, which measures the expected value of Y that does not depend on the dependent variables

  • b1, b2, b3, etc are called the slopes or the coefficients

  • ϵ are the residuals, the errors of the equation in the data

2 / 12

3 / 12

Example: FOI vs GDP

4 / 12

Regression residuals

Residuals ( ϵ ) are the differences in between the empirical values Yi and their fitted values ˆYi.

5 / 12

Ordinary Least Squares (OLS)

Fitting a regression model is the task of finding the values of the coefficients ( a, b1, b2, etc ) in a way that reduce a way to aggregate the residuals of the model. One approach is called Residual Sum of Squares (RSS), which aggregates residuals as:

RSS=i(ˆYiYi)2

The Ordinary Least Squares method (OLS) looks for the values of coefficients that minimize the RSS. This way, you can think about the OLS result as the line that minimizes the sum of squared lengths of the vertical lines in the figure above.
6 / 12

Regression in R

The lm() function in R fits a linear regression model with OLS. You have to specify the formula of your regression model. For the case of one independent variable, a formula reads like this:

DependentVariable ∼ IndependentVariable

If you print the result of lm(), you will see the best fitting values for the coefficients (intercept and slope):

model <- lm(GDP~FOI, df)
model$coefficients
## (Intercept) FOI
## -4309.223 54631.170
7 / 12

Goodness of fit

A way to measure the quality of a model fit this is to calculate the proportion of variance of the dependent variable ( V[Y] ) that is explained by the model. We can do this by comparing the variance of residuals ( V[ϵ] ) to the variance of Y.

This is captured by the coefficient of determination, also known as R2:

R2=1V[ϵ]V[Y]

For our model example:
1-var(residuals(model))/var(df$GDP)
## [1] 0.4432583
8 / 12

Multiple regression

You can specify models with more than one independent variable by using "+":

DependentVariable ∼ IndependentVariable1 + IndependentVariable2 + IndependentVariable3

If we wanted to fit a model of GDP as a linear combination of the FOI and the internet penetration in countries, we can do it as follows:

model2 <- lm(GDP~FOI+IT.NET.USER.ZS, df)
model2$coefficients
## (Intercept) FOI IT.NET.USER.ZS
## -16154.2983 20528.8273 539.8481
summary(model2)$r.squared
## [1] 0.8140538
9 / 12

Regression diagnostics

hist(residuals(model), main="", cex.lab=1.5, cex.axis=2)

10 / 12

Regression diagnostics

plot(predict(model), residuals(model), cex.lab=1.5, cex.axis=2)

11 / 12
12 / 12

Linear Regression

Regression models formalize an equation in which one numeric variable Y is formulated as a linear function of other variables X1, X2, X3, etc:

Y=a+b1X1+b2X2+b3X3...+ϵ
  • Y is called the dependent variable

  • X1, X2, X3, etc are called independent variables

  • a is the intercept, which measures the expected value of Y that does not depend on the dependent variables

  • b1, b2, b3, etc are called the slopes or the coefficients

  • ϵ are the residuals, the errors of the equation in the data

2 / 12
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow