Date Tags R

This is my note on swirl course Regression Model : Multivariable Examples 3.

We'll use the hunger dataset from the course:

download.file("https://raw.githubusercontent.com/swirldev/swirl_courses/master/Regression_Models/MultiVar_Examples3/hunger.csv", 
    "hunger.csv", method = "curl", quiet = T)
hunger <- read.csv("hunger.csv")
head(hunger)
##    X                              Indicator Data.Source PUBLISH.STATES Year WHO.region  Country
## 1  8 Children aged <5 years underweight (%) NLIS_310044      Published 1986     Africa  Senegal
## 2 11 Children aged <5 years underweight (%) NLIS_310095      Published 1989     Africa   Uganda
## 3 13 Children aged <5 years underweight (%) NLIS_310138      Published 1988     Africa Zimbabwe
## 4 16 Children aged <5 years underweight (%) NLIS_310044      Published 1986     Africa  Senegal
## 5 18 Children aged <5 years underweight (%) NLIS_310095      Published 1989     Africa   Uganda
## 6 21 Children aged <5 years underweight (%) NLIS_310138      Published 1988     Africa Zimbabwe
##      Sex Display.Value Numeric Low High Comments
## 1   Male          19.3    19.3  NA   NA       NA
## 2 Female          19.1    19.1  NA   NA       NA
## 3 Female           7.2     7.2  NA   NA       NA
## 4 Female          15.3    15.3  NA   NA       NA
## 5   Male          20.4    20.4  NA   NA       NA
## 6   Male           8.7     8.7  NA   NA       NA

The Numeric column gives the percentage of children under age 5 who were underweight when that sample was taken.
We fit a simple linear regression for Numeric (outcome) on Year:

fit <- lm(Numeric ~ Year, hunger)
summary(fit)
## 
## Call:
## lm(formula = Numeric ~ Year, data = hunger)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.621 -11.196  -1.994   7.085  45.039 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 634.47966  121.14460   5.237 2.01e-07 ***
## Year         -0.30840    0.06053  -5.095 4.21e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.23 on 946 degrees of freedom
## Multiple R-squared:  0.02671,    Adjusted R-squared:  0.02568 
## F-statistic: 25.96 on 1 and 946 DF,  p-value: 4.209e-07

As time goes on, the rate of hunger decreases.
The intercept of the model represents the percentage of hungry children at year 0.

Let's look at the rates of hunger for the different genders to see how, or even if, they differ:

lmF <- lm(Numeric[hunger$Sex == "Female"] ~ Year[hunger$Sex == "Female"], data = hunger)
summary(lmF)
## 
## Call:
## lm(formula = Numeric[hunger$Sex == "Female"] ~ Year[hunger$Sex == 
##     "Female"], data = hunger)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.228 -10.638  -1.959   6.859  46.146 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  603.50580  167.53201   3.602 0.000349 ***
## Year[hunger$Sex == "Female"]  -0.29340    0.08371  -3.505 0.000500 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.94 on 472 degrees of freedom
## Multiple R-squared:  0.02537,    Adjusted R-squared:  0.0233 
## F-statistic: 12.29 on 1 and 472 DF,  p-value: 5e-04
lmM <- lm(Numeric[hunger$Sex == "Male"] ~ Year[hunger$Sex == "Male"], data = hunger)
summary(lmM)
## 
## Call:
## lm(formula = Numeric[hunger$Sex == "Male"] ~ Year[hunger$Sex == 
##     "Male"], data = hunger)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.913 -11.741  -1.832   7.399  42.255 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                665.45352  174.50726   3.813 0.000155 ***
## Year[hunger$Sex == "Male"]  -0.32340    0.08719  -3.709 0.000233 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.48 on 472 degrees of freedom
## Multiple R-squared:  0.02832,    Adjusted R-squared:  0.02626 
## F-statistic: 13.76 on 1 and 472 DF,  p-value: 0.0002328

We plot the data points and fitted lines using different colors to distinguish between males (blue) and females (pink).

plot(hunger$Year, hunger$Numeric, type = "n")
points(hunger$Year, hunger$Numeric, pch = 19, col = ((hunger$Sex == "Female") * 1 + 125))
lines(hunger$Year[hunger$Sex == "Male"], lmM$fitted, col = "blue", lwd = 3)
lines(hunger$Year[hunger$Sex == "Female"], lmF$fitted, col = "red", lwd = 3)

We can see from the plot that the lines are not exactly parallel. On the right side of the graph (around the year 2010) they are closer together than on the left side (around 1970). Slopes are -0.29340 for femals, -0.32340 for males.

Now instead of separating the data by subsetting the samples by gender, we'll use gender as another predictor to create the linear model:

lmBoth <- lm(Numeric ~ Year + Sex, data = hunger)
summary(lmBoth)
## 
## Call:
## lm(formula = Numeric ~ Year + Sex, data = hunger)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.472 -11.297  -1.848   7.058  45.990 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 633.5283   120.8950   5.240 1.98e-07 ***
## Year         -0.3084     0.0604  -5.106 3.99e-07 ***
## SexMale       1.9027     0.8576   2.219   0.0267 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.2 on 945 degrees of freedom
## Multiple R-squared:  0.03175,    Adjusted R-squared:  0.0297 
## F-statistic: 15.49 on 2 and 945 DF,  p-value: 2.392e-07

Notice that Male and Female are factors (factors are treated in alphabetic order, so reference, here, is the Female group).
The intercept represents the percentage of hungry females at year 0.
The estimate for the factor Male 1.9027 is a distance from the intercept (the estimate of the reference group Female). So the percentage of hungry males at year 0 is the sum of the intercept and the male estimate, ie. 633.5283 + 1.9027 = 635.431
The estimate for hunger$Year represents the annual decrease in percentage of both gender.

In the plot, the red line will have the female intercept and the blue line will have the male intercept:

plot(hunger$Year, hunger$Numeric, pch = 19)
points(hunger$Year, hunger$Numeric, pch = 19, col = ((hunger$Sex == "Female") * 1 + 125))
abline(c(lmBoth$coeff[1], lmBoth$coeff[2]), col = "red", lwd = 3)
abline(c(lmBoth$coeff[1] + lmBoth$coeff[3], lmBoth$coeff[2]), col = "blue", lwd = 3)

The lines are parallels (since, they have the same slope lmBoth$coeff[2]).

Now we'll consider the interaction between year and gender to see how that affects changes in rates of hunger:

lmInter <- lm(Numeric ~ Year + Sex + Sex * Year, data = hunger)
summary(lmInter)
## 
## Call:
## lm(formula = Numeric ~ Year + Sex + Sex * Year, data = hunger)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.913 -11.248  -1.853   7.087  46.146 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  603.50580  171.05519   3.528 0.000439 ***
## Year          -0.29340    0.08547  -3.433 0.000623 ***
## SexMale       61.94772  241.90858   0.256 0.797946    
## Year:SexMale  -0.03000    0.12087  -0.248 0.804022    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.21 on 944 degrees of freedom
## Multiple R-squared:  0.03181,    Adjusted R-squared:  0.02874 
## F-statistic: 10.34 on 3 and 944 DF,  p-value: 1.064e-06

The percentage of hungry females at year 0 is 603.5058.
The percentage of hungry males at year 0 is 665.4535 (603.50580 + 61.94772).
The annual change (decrease) in percentage of hungry females is 0.29340.
The estimate associated with Year:SexMale represents the distance of the annual change in percent of males from that of females. The annual change in percentage of hungry males is -0.32340 (-0.29340 - 0.03000).

plot(hunger$Year, hunger$Numeric, pch = 19)
points(hunger$Year, hunger$Numeric, pch = 19, col = ((hunger$Sex == "Male") * 1 + 125))
abline(c(lmInter$coeff[1], lmInter$coeff[2]), col = "red", lwd = 3)
abline(c(lmInter$coeff[1] + lmInter$coeff[3], lmInter$coeff[2] + lmInter$coeff[4]), col = "blue", lwd = 3)

The lines are not parallel and will eventually intersect. The Male blue line indicates a faster rate of change.

Here we're dealing with an interaction between factors.

Suppose we have two interacting predictors and one of them is held constant.
The expected change in the outcome for a unit change in the other predictor is the coefficient of that changing predictor + the coefficient of the interaction * the value of the predictor held constant.

For example, let's the model be : \(H_i = b_0 + (b_1*I_i) + (b_2*Y_i)+ (b_3*I_i*Y_i)\) with H the outcomes, I's and Y's the predictors and b's the estimated coefficients of the predictors.
If I is fixed at a value, 5 and Y varies, \(b_2 + b3*5\) represents the change in H per unit change in Y given that I is fixed at 5.