Librarires, or R-packages are loaded into the R environmen using library() or require()
require(ISLR)
## Loading required package: ISLR
require(MASS)
## Loading required package: MASS
require(GGally)
## Loading required package: GGally
require(ggplot2)
## Loading required package: ggplot2
In this example we will explor th Boston dataset contained within the MASS package. We will seek to predict the medv (median house value, in $1000) using rm (average house room number), age (average age of houses in neighborhood) and lstat (percentage of households with low socioeconomic status).
# First lets explore that Boston dataset
dim(Boston)
## [1] 506 14
names(Boston)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
#summary(Boston)
# Plot scatterplo matrix to explore relationships/correlations
ggpairs(Boston[,c("medv", "age", "rm", "lstat")], alpha=0.1)
From the scatterplot matrix we can see there is an inverse relationship between medv and lstat; that is, when the median value of houses in the neighborhood increases, there is a corresponding drop in the percentage of households with low socioeconomic status.
ggplot(Boston, aes(x=medv, y=lstat)) +
geom_jitter(alpha=0.7) +
labs(title="lstat vs medv", x="Median value of homes", y="% homes with low socioeconomic status")
From the scatterplot matrix we can also see the positive relationship between medv and rm
ggplot(Boston, aes(x=medv, y=rm)) +
geom_jitter(alpha=0.7) +
labs(title="rm vs medv", x="Median value of homes", y="Avg house room number")
The correlation between medv and age is less apparent however, there does seem to be a negative relation ship. It appears as if the median value of the house decreases slightly with age.
ggplot(Boston, aes(x=medv, y=age)) +
geom_jitter(alpha=0.7) +
labs(title="age vs medv", x="Median value of homes", y="Average age of house")
Now lets start building the linear regression model. First we’ll make a model using a single regression (feature/covariate) and then we’ll build the model up by including more features. To start, lets try regressing medv on rm and see how that works.
fit <- lm(medv ~ lstat, data=Boston)
summary(fit)
##
## Call:
## lm(formula = medv ~ lstat, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.168 -3.990 -1.318 2.034 24.500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.55384 0.56263 61.41 <2e-16 ***
## lstat -0.95005 0.03873 -24.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432
## F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16
confint(fit)
## 2.5 % 97.5 %
## (Intercept) 33.448457 35.6592247
## lstat -1.026148 -0.8739505
# Check to see how the model fits
intercept <- coef(fit)[1]
slope <- coef(fit)[2]
ggplot(Boston, aes(x=lstat, y=medv)) +
geom_point(position=position_jitter(), alpha=0.6) +
geom_abline(intercept=intercept, slope=slope, col="red")
From the summary output we can see that both parameter estimates are significant, that is both the intercept and the slope have P-values < 0.05. Additionally, the slope parameter is greater than 0 and the confidence intervals for the slope paramter do not cross 0, suggesting a significant relationship between rm and medv. In fact, the slope is 9.102 which suggests that each additional room changes the median value of the house by $9,102.
We can check assumptions about our model such as equal variance by plotting the residuals by calling the plot() method. This will also tell us if our data is non-linear. If the data is non-linear than the residuals will exhibit some sort of pattern when plotted.
par(mfrow=c(2,2))
plot(fit)
On the basis of the residuals there appears to be some non-linearity in our data. Additinally, residuals and studentized residuals can be plotted using studres(fit) and residuals(fit) respectively. Leverage statistics can be computed using hatvalues(fit).
Once we have a working model we can make predictions using the predict() method. In this example we will predict the median value of houses in a neighborhodo given average room sizes of houses in that neighborhood.
# Create a dataframe containing houses with various room sizes
new_data <- data.frame(lstat=c(8, 9, 10, 15, 20, 30))
# Return predictions with confidence intervals
predict(fit, new_data, interval="confidence")
## fit lwr upr
## 1 26.95345 26.305287 27.601605
## 2 26.00340 25.393471 26.613322
## 3 25.05335 24.474132 25.632563
## 4 20.30310 19.731588 20.874613
## 5 15.55285 14.773550 16.332157
## 6 6.05236 4.625004 7.479716
# Return predictions with prediciton intervals
predict(fit, new_data, interval="prediction")
## fit lwr upr
## 1 26.95345 14.724265 39.18263
## 2 26.00340 13.776182 38.23061
## 3 25.05335 12.827626 37.27907
## 4 20.30310 8.077742 32.52846
## 5 15.55285 3.316021 27.78969
## 6 6.05236 -6.242765 18.34749
You can see that the predicted medv appears to increase by about $9,102 for every one unit increase in rm.
In the above example we created a model to predict medv using only a single variable, rm however we can build more complex models by including more regressiors. For example, we could include age, lstat as well as rm into our model. We could the determine which variables significantly influence (i.e. predict) medv.
fit <- lm(medv ~ lstat + age, data=Boston)
summary(fit)
##
## Call:
## lm(formula = medv ~ lstat + age, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.981 -3.978 -1.283 1.968 23.158
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.22276 0.73085 45.458 < 2e-16 ***
## lstat -1.03207 0.04819 -21.416 < 2e-16 ***
## age 0.03454 0.01223 2.826 0.00491 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.173 on 503 degrees of freedom
## Multiple R-squared: 0.5513, Adjusted R-squared: 0.5495
## F-statistic: 309 on 2 and 503 DF, p-value: < 2.2e-16
We can create a model with all 13 variables in Boston dataset
# We can use . to tell the lm function to use all variables
fit <- lm(medv ~ . , data=Boston)
summary(fit)
##
## Call:
## lm(formula = medv ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.595 -2.730 -0.518 1.777 26.199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
## crim -1.080e-01 3.286e-02 -3.287 0.001087 **
## zn 4.642e-02 1.373e-02 3.382 0.000778 ***
## indus 2.056e-02 6.150e-02 0.334 0.738288
## chas 2.687e+00 8.616e-01 3.118 0.001925 **
## nox -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
## rm 3.810e+00 4.179e-01 9.116 < 2e-16 ***
## age 6.922e-04 1.321e-02 0.052 0.958229
## dis -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
## rad 3.060e-01 6.635e-02 4.613 5.07e-06 ***
## tax -1.233e-02 3.760e-03 -3.280 0.001112 **
## ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
## black 9.312e-03 2.686e-03 3.467 0.000573 ***
## lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
## F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
fit_summary <- summary(fit)
fit_summary$r.sq
## [1] 0.7406427
fit_summary$sigma
## [1] 4.745298
The car package can be used to compute the variance inflation factors
library(car)
vif(fit)
## crim zn indus chas nox rm age dis
## 1.792192 2.298758 3.991596 1.073995 4.393720 1.933744 3.100826 3.955945
## rad tax ptratio black lstat
## 7.484496 9.008554 1.799084 1.348521 2.941491
We can create a model using all but one of the variables
# Create a new model with all variables except age
fit <- lm(medv ~ . -age, data=Boston)
# Update a model to include age
fit <- update(fit, ~ . + age, data=Boston)
summary(fit)
##
## Call:
## lm(formula = medv ~ crim + zn + indus + chas + nox + rm + dis +
## rad + tax + ptratio + black + lstat + age, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.595 -2.730 -0.518 1.777 26.199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
## crim -1.080e-01 3.286e-02 -3.287 0.001087 **
## zn 4.642e-02 1.373e-02 3.382 0.000778 ***
## indus 2.056e-02 6.150e-02 0.334 0.738288
## chas 2.687e+00 8.616e-01 3.118 0.001925 **
## nox -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
## rm 3.810e+00 4.179e-01 9.116 < 2e-16 ***
## dis -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
## rad 3.060e-01 6.635e-02 4.613 5.07e-06 ***
## tax -1.233e-02 3.760e-03 -3.280 0.001112 **
## ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
## black 9.312e-03 2.686e-03 3.467 0.000573 ***
## lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
## age 6.922e-04 1.321e-02 0.052 0.958229
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
## F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
We can include interaction effects between variables by using rm:lstat which includes only the interaction in the model. However, if we use rm*lstat the notation will automatically include the individual variables in the model.
fit <- lm(medv ~ age:rm, data=Boston)
fit2 <- lm(medv ~ age*rm, data=Boston)
summary(fit)
##
## Call:
## lm(formula = medv ~ age:rm, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.335 -5.097 -1.917 2.595 31.089
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.818720 1.058235 25.343 < 2e-16 ***
## age:rm -0.010056 0.002297 -4.378 1.46e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.036 on 504 degrees of freedom
## Multiple R-squared: 0.03663, Adjusted R-squared: 0.03472
## F-statistic: 19.16 on 1 and 504 DF, p-value: 1.459e-05
summary(fit2)
##
## Call:
## lm(formula = medv ~ age * rm, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.864 -3.059 -0.182 2.191 38.604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -59.59286 7.78687 -7.653 1.01e-13 ***
## age 0.38153 0.09673 3.944 9.15e-05 ***
## rm 13.74747 1.20186 11.438 < 2e-16 ***
## age:rm -0.07141 0.01512 -4.722 3.03e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.186 on 502 degrees of freedom
## Multiple R-squared: 0.5503, Adjusted R-squared: 0.5476
## F-statistic: 204.7 on 3 and 502 DF, p-value: < 2.2e-16
We’ve seen that there seems to be a non-linear relationship between medv and lstat we can fit a non-linear model to this by performing some sort of tranformation on the variables.
quad_fit <- lm(medv ~ lstat + I(lstat^2), data=Boston)
linear_fit <- lm(medv ~ lstat, data=Boston)
predictions <- data.frame(pred=predict(quad_fit, data.frame(lstat=1:30)))
ggplot(Boston, aes(x=lstat, y=medv)) + geom_point(alpha=0.5) +
geom_line(data=predictions, aes(x=1:dim(predictions)[1], y=pred), color="red") +
geom_abline(intercept=intercept, slope=slope, color="blue")
The plot above shows the two different models; the linear model and the quadratic model it appears that the non-linear model fits the data better, however to test this more formally using anova()
# Assess whether the null hypothesis is true, i.e. the two models are the same
anova(quad_fit, linear_fit)
## Analysis of Variance Table
##
## Model 1: medv ~ lstat + I(lstat^2)
## Model 2: medv ~ lstat
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 503 15347
## 2 504 19472 -1 -4125.1 135.2 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Lets inspect the residuals and leverage statistics for the new plot
par(mfrow=c(2,2))
plot(quad_fit)
Using the
poly() function will include all lower-order polynomial terms up to the specified order.
fit <- lm(medv ~ poly(lstat, 5), data=Boston)
summary(fit)
##
## Call:
## lm(formula = medv ~ poly(lstat, 5), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5433 -3.1039 -0.7052 2.0844 27.1153
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.5328 0.2318 97.197 < 2e-16 ***
## poly(lstat, 5)1 -152.4595 5.2148 -29.236 < 2e-16 ***
## poly(lstat, 5)2 64.2272 5.2148 12.316 < 2e-16 ***
## poly(lstat, 5)3 -27.0511 5.2148 -5.187 3.10e-07 ***
## poly(lstat, 5)4 25.4517 5.2148 4.881 1.42e-06 ***
## poly(lstat, 5)5 -19.2524 5.2148 -3.692 0.000247 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.215 on 500 degrees of freedom
## Multiple R-squared: 0.6817, Adjusted R-squared: 0.6785
## F-statistic: 214.2 on 5 and 500 DF, p-value: < 2.2e-16
Qualtative variables can be used in regression models too, to see how R encodes these variables we can use the contrasts() function
data(Carseats)
head(Carseats)
## Sales CompPrice Income Advertising Population Price ShelveLoc Age
## 1 9.50 138 73 11 276 120 Bad 42
## 2 11.22 111 48 16 260 83 Good 65
## 3 10.06 113 35 10 269 80 Medium 59
## 4 7.40 117 100 4 466 97 Medium 55
## 5 4.15 141 64 3 340 128 Bad 38
## 6 10.81 124 113 13 501 72 Bad 78
## Education Urban US
## 1 17 Yes Yes
## 2 10 Yes Yes
## 3 12 Yes Yes
## 4 14 Yes Yes
## 5 13 Yes No
## 6 16 No Yes
contrasts(Carseats$ShelveLoc)
## Good Medium
## Bad 0 0
## Good 1 0
## Medium 0 1
Now lets try to predict carseat sales, Sales using the features of Carseats. We will include interaction terms for Income and Advertising as well as Price and Age
fit <- lm(Sales ~ . + Income:Advertising + Price:Age, data=Carseats)
summary(fit)
##
## Call:
## lm(formula = Sales ~ . + Income:Advertising + Price:Age, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9208 -0.7503 0.0177 0.6754 3.3413
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.5755654 1.0087470 6.519 2.22e-10 ***
## CompPrice 0.0929371 0.0041183 22.567 < 2e-16 ***
## Income 0.0108940 0.0026044 4.183 3.57e-05 ***
## Advertising 0.0702462 0.0226091 3.107 0.002030 **
## Population 0.0001592 0.0003679 0.433 0.665330
## Price -0.1008064 0.0074399 -13.549 < 2e-16 ***
## ShelveLocGood 4.8486762 0.1528378 31.724 < 2e-16 ***
## ShelveLocMedium 1.9532620 0.1257682 15.531 < 2e-16 ***
## Age -0.0579466 0.0159506 -3.633 0.000318 ***
## Education -0.0208525 0.0196131 -1.063 0.288361
## UrbanYes 0.1401597 0.1124019 1.247 0.213171
## USYes -0.1575571 0.1489234 -1.058 0.290729
## Income:Advertising 0.0007510 0.0002784 2.698 0.007290 **
## Price:Age 0.0001068 0.0001333 0.801 0.423812
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.011 on 386 degrees of freedom
## Multiple R-squared: 0.8761, Adjusted R-squared: 0.8719
## F-statistic: 210 on 13 and 386 DF, p-value: < 2.2e-16