Libraries

Librarires, or R-packages are loaded into the R environmen using library() or require()

require(ISLR)
## Loading required package: ISLR
require(MASS)
## Loading required package: MASS
require(GGally)
## Loading required package: GGally
require(ggplot2)
## Loading required package: ggplot2

Simple linear regression

In this example we will explor th Boston dataset contained within the MASS package. We will seek to predict the medv (median house value, in $1000) using rm (average house room number), age (average age of houses in neighborhood) and lstat (percentage of households with low socioeconomic status).

# First lets explore that Boston dataset
dim(Boston)
## [1] 506  14
names(Boston)
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"
#summary(Boston)

# Plot scatterplo matrix to explore relationships/correlations
ggpairs(Boston[,c("medv", "age", "rm", "lstat")], alpha=0.1)

From the scatterplot matrix we can see there is an inverse relationship between medv and lstat; that is, when the median value of houses in the neighborhood increases, there is a corresponding drop in the percentage of households with low socioeconomic status.

ggplot(Boston, aes(x=medv, y=lstat)) +
  geom_jitter(alpha=0.7) +
  labs(title="lstat vs medv", x="Median value of homes", y="% homes with low socioeconomic status")

From the scatterplot matrix we can also see the positive relationship between medv and rm

ggplot(Boston, aes(x=medv, y=rm)) +
  geom_jitter(alpha=0.7) +
  labs(title="rm vs medv", x="Median value of homes", y="Avg house room number")

The correlation between medv and age is less apparent however, there does seem to be a negative relation ship. It appears as if the median value of the house decreases slightly with age.

ggplot(Boston, aes(x=medv, y=age)) + 
  geom_jitter(alpha=0.7) + 
  labs(title="age vs medv", x="Median value of homes", y="Average age of house")

Linear regression

Now lets start building the linear regression model. First we’ll make a model using a single regression (feature/covariate) and then we’ll build the model up by including more features. To start, lets try regressing medv on rm and see how that works.

fit <- lm(medv ~ lstat, data=Boston)
summary(fit)
## 
## Call:
## lm(formula = medv ~ lstat, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.168  -3.990  -1.318   2.034  24.500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.55384    0.56263   61.41   <2e-16 ***
## lstat       -0.95005    0.03873  -24.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared:  0.5441, Adjusted R-squared:  0.5432 
## F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16
confint(fit)
##                 2.5 %     97.5 %
## (Intercept) 33.448457 35.6592247
## lstat       -1.026148 -0.8739505
# Check to see how the model fits
intercept <- coef(fit)[1]
slope <- coef(fit)[2]
ggplot(Boston, aes(x=lstat, y=medv)) +
  geom_point(position=position_jitter(), alpha=0.6) + 
  geom_abline(intercept=intercept, slope=slope, col="red")

From the summary output we can see that both parameter estimates are significant, that is both the intercept and the slope have P-values < 0.05. Additionally, the slope parameter is greater than 0 and the confidence intervals for the slope paramter do not cross 0, suggesting a significant relationship between rm and medv. In fact, the slope is 9.102 which suggests that each additional room changes the median value of the house by $9,102.

Checking assumptions

We can check assumptions about our model such as equal variance by plotting the residuals by calling the plot() method. This will also tell us if our data is non-linear. If the data is non-linear than the residuals will exhibit some sort of pattern when plotted.

par(mfrow=c(2,2))
plot(fit)

On the basis of the residuals there appears to be some non-linearity in our data. Additinally, residuals and studentized residuals can be plotted using studres(fit) and residuals(fit) respectively. Leverage statistics can be computed using hatvalues(fit).

Making predicitons

Once we have a working model we can make predictions using the predict() method. In this example we will predict the median value of houses in a neighborhodo given average room sizes of houses in that neighborhood.

# Create a dataframe containing houses with various room sizes
new_data <- data.frame(lstat=c(8, 9, 10, 15, 20, 30))

# Return predictions with confidence intervals 
predict(fit, new_data, interval="confidence")
##        fit       lwr       upr
## 1 26.95345 26.305287 27.601605
## 2 26.00340 25.393471 26.613322
## 3 25.05335 24.474132 25.632563
## 4 20.30310 19.731588 20.874613
## 5 15.55285 14.773550 16.332157
## 6  6.05236  4.625004  7.479716
# Return predictions with prediciton intervals
predict(fit, new_data, interval="prediction")
##        fit       lwr      upr
## 1 26.95345 14.724265 39.18263
## 2 26.00340 13.776182 38.23061
## 3 25.05335 12.827626 37.27907
## 4 20.30310  8.077742 32.52846
## 5 15.55285  3.316021 27.78969
## 6  6.05236 -6.242765 18.34749

You can see that the predicted medv appears to increase by about $9,102 for every one unit increase in rm.

Multiple linear regression

In the above example we created a model to predict medv using only a single variable, rm however we can build more complex models by including more regressiors. For example, we could include age, lstat as well as rm into our model. We could the determine which variables significantly influence (i.e. predict) medv.

fit <- lm(medv ~ lstat + age, data=Boston)
summary(fit)
## 
## Call:
## lm(formula = medv ~ lstat + age, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.981  -3.978  -1.283   1.968  23.158 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.22276    0.73085  45.458  < 2e-16 ***
## lstat       -1.03207    0.04819 -21.416  < 2e-16 ***
## age          0.03454    0.01223   2.826  0.00491 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.173 on 503 degrees of freedom
## Multiple R-squared:  0.5513, Adjusted R-squared:  0.5495 
## F-statistic:   309 on 2 and 503 DF,  p-value: < 2.2e-16

We can create a model with all 13 variables in Boston dataset

# We can use . to tell the lm function to use all variables
fit <- lm(medv ~ . , data=Boston)
summary(fit)
## 
## Call:
## lm(formula = medv ~ ., data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.595  -2.730  -0.518   1.777  26.199 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.646e+01  5.103e+00   7.144 3.28e-12 ***
## crim        -1.080e-01  3.286e-02  -3.287 0.001087 ** 
## zn           4.642e-02  1.373e-02   3.382 0.000778 ***
## indus        2.056e-02  6.150e-02   0.334 0.738288    
## chas         2.687e+00  8.616e-01   3.118 0.001925 ** 
## nox         -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
## rm           3.810e+00  4.179e-01   9.116  < 2e-16 ***
## age          6.922e-04  1.321e-02   0.052 0.958229    
## dis         -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
## rad          3.060e-01  6.635e-02   4.613 5.07e-06 ***
## tax         -1.233e-02  3.760e-03  -3.280 0.001112 ** 
## ptratio     -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
## black        9.312e-03  2.686e-03   3.467 0.000573 ***
## lstat       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7338 
## F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16
fit_summary <- summary(fit)
fit_summary$r.sq
## [1] 0.7406427
fit_summary$sigma
## [1] 4.745298

The car package can be used to compute the variance inflation factors

library(car)
vif(fit)
##     crim       zn    indus     chas      nox       rm      age      dis 
## 1.792192 2.298758 3.991596 1.073995 4.393720 1.933744 3.100826 3.955945 
##      rad      tax  ptratio    black    lstat 
## 7.484496 9.008554 1.799084 1.348521 2.941491

We can create a model using all but one of the variables

# Create a new model with all variables except age
fit <- lm(medv ~ . -age, data=Boston)

# Update a model to include age
fit <- update(fit, ~ . + age, data=Boston)
summary(fit)
## 
## Call:
## lm(formula = medv ~ crim + zn + indus + chas + nox + rm + dis + 
##     rad + tax + ptratio + black + lstat + age, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.595  -2.730  -0.518   1.777  26.199 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.646e+01  5.103e+00   7.144 3.28e-12 ***
## crim        -1.080e-01  3.286e-02  -3.287 0.001087 ** 
## zn           4.642e-02  1.373e-02   3.382 0.000778 ***
## indus        2.056e-02  6.150e-02   0.334 0.738288    
## chas         2.687e+00  8.616e-01   3.118 0.001925 ** 
## nox         -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
## rm           3.810e+00  4.179e-01   9.116  < 2e-16 ***
## dis         -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
## rad          3.060e-01  6.635e-02   4.613 5.07e-06 ***
## tax         -1.233e-02  3.760e-03  -3.280 0.001112 ** 
## ptratio     -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
## black        9.312e-03  2.686e-03   3.467 0.000573 ***
## lstat       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
## age          6.922e-04  1.321e-02   0.052 0.958229    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7338 
## F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16

Interaction terms

We can include interaction effects between variables by using rm:lstat which includes only the interaction in the model. However, if we use rm*lstat the notation will automatically include the individual variables in the model.

fit <- lm(medv ~ age:rm, data=Boston)
fit2 <- lm(medv ~ age*rm, data=Boston)
summary(fit)
## 
## Call:
## lm(formula = medv ~ age:rm, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.335  -5.097  -1.917   2.595  31.089 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 26.818720   1.058235  25.343  < 2e-16 ***
## age:rm      -0.010056   0.002297  -4.378 1.46e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.036 on 504 degrees of freedom
## Multiple R-squared:  0.03663,    Adjusted R-squared:  0.03472 
## F-statistic: 19.16 on 1 and 504 DF,  p-value: 1.459e-05
summary(fit2)
## 
## Call:
## lm(formula = medv ~ age * rm, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.864  -3.059  -0.182   2.191  38.604 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -59.59286    7.78687  -7.653 1.01e-13 ***
## age           0.38153    0.09673   3.944 9.15e-05 ***
## rm           13.74747    1.20186  11.438  < 2e-16 ***
## age:rm       -0.07141    0.01512  -4.722 3.03e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.186 on 502 degrees of freedom
## Multiple R-squared:  0.5503, Adjusted R-squared:  0.5476 
## F-statistic: 204.7 on 3 and 502 DF,  p-value: < 2.2e-16

Non-linear transformations on variables

We’ve seen that there seems to be a non-linear relationship between medv and lstat we can fit a non-linear model to this by performing some sort of tranformation on the variables.

quad_fit <- lm(medv ~ lstat + I(lstat^2), data=Boston)
linear_fit <- lm(medv ~ lstat, data=Boston)
predictions <- data.frame(pred=predict(quad_fit, data.frame(lstat=1:30)))
ggplot(Boston, aes(x=lstat, y=medv)) + geom_point(alpha=0.5) + 
  geom_line(data=predictions, aes(x=1:dim(predictions)[1], y=pred), color="red") + 
  geom_abline(intercept=intercept, slope=slope, color="blue")

The plot above shows the two different models; the linear model and the quadratic model it appears that the non-linear model fits the data better, however to test this more formally using anova()

# Assess whether the null hypothesis is true, i.e. the two models are the same
anova(quad_fit, linear_fit)
## Analysis of Variance Table
## 
## Model 1: medv ~ lstat + I(lstat^2)
## Model 2: medv ~ lstat
##   Res.Df   RSS Df Sum of Sq     F    Pr(>F)    
## 1    503 15347                                 
## 2    504 19472 -1   -4125.1 135.2 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Lets inspect the residuals and leverage statistics for the new plot

par(mfrow=c(2,2))
plot(quad_fit)

Using the poly() function will include all lower-order polynomial terms up to the specified order.

fit <- lm(medv ~ poly(lstat, 5), data=Boston)
summary(fit)
## 
## Call:
## lm(formula = medv ~ poly(lstat, 5), data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5433  -3.1039  -0.7052   2.0844  27.1153 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       22.5328     0.2318  97.197  < 2e-16 ***
## poly(lstat, 5)1 -152.4595     5.2148 -29.236  < 2e-16 ***
## poly(lstat, 5)2   64.2272     5.2148  12.316  < 2e-16 ***
## poly(lstat, 5)3  -27.0511     5.2148  -5.187 3.10e-07 ***
## poly(lstat, 5)4   25.4517     5.2148   4.881 1.42e-06 ***
## poly(lstat, 5)5  -19.2524     5.2148  -3.692 0.000247 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.215 on 500 degrees of freedom
## Multiple R-squared:  0.6817, Adjusted R-squared:  0.6785 
## F-statistic: 214.2 on 5 and 500 DF,  p-value: < 2.2e-16

Qualitative predictors

Qualtative variables can be used in regression models too, to see how R encodes these variables we can use the contrasts() function

data(Carseats)
head(Carseats)
##   Sales CompPrice Income Advertising Population Price ShelveLoc Age
## 1  9.50       138     73          11        276   120       Bad  42
## 2 11.22       111     48          16        260    83      Good  65
## 3 10.06       113     35          10        269    80    Medium  59
## 4  7.40       117    100           4        466    97    Medium  55
## 5  4.15       141     64           3        340   128       Bad  38
## 6 10.81       124    113          13        501    72       Bad  78
##   Education Urban  US
## 1        17   Yes Yes
## 2        10   Yes Yes
## 3        12   Yes Yes
## 4        14   Yes Yes
## 5        13   Yes  No
## 6        16    No Yes
contrasts(Carseats$ShelveLoc)
##        Good Medium
## Bad       0      0
## Good      1      0
## Medium    0      1

Now lets try to predict carseat sales, Sales using the features of Carseats. We will include interaction terms for Income and Advertising as well as Price and Age

fit <- lm(Sales ~ . + Income:Advertising + Price:Age, data=Carseats)
summary(fit)
## 
## Call:
## lm(formula = Sales ~ . + Income:Advertising + Price:Age, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9208 -0.7503  0.0177  0.6754  3.3413 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         6.5755654  1.0087470   6.519 2.22e-10 ***
## CompPrice           0.0929371  0.0041183  22.567  < 2e-16 ***
## Income              0.0108940  0.0026044   4.183 3.57e-05 ***
## Advertising         0.0702462  0.0226091   3.107 0.002030 ** 
## Population          0.0001592  0.0003679   0.433 0.665330    
## Price              -0.1008064  0.0074399 -13.549  < 2e-16 ***
## ShelveLocGood       4.8486762  0.1528378  31.724  < 2e-16 ***
## ShelveLocMedium     1.9532620  0.1257682  15.531  < 2e-16 ***
## Age                -0.0579466  0.0159506  -3.633 0.000318 ***
## Education          -0.0208525  0.0196131  -1.063 0.288361    
## UrbanYes            0.1401597  0.1124019   1.247 0.213171    
## USYes              -0.1575571  0.1489234  -1.058 0.290729    
## Income:Advertising  0.0007510  0.0002784   2.698 0.007290 ** 
## Price:Age           0.0001068  0.0001333   0.801 0.423812    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.011 on 386 degrees of freedom
## Multiple R-squared:  0.8761, Adjusted R-squared:  0.8719 
## F-statistic:   210 on 13 and 386 DF,  p-value: < 2.2e-16