Librarires, or R-packages are loaded into the R environmen using library()
or require()
require(ISLR)
## Loading required package: ISLR
require(MASS)
## Loading required package: MASS
require(GGally)
## Loading required package: GGally
require(ggplot2)
## Loading required package: ggplot2
In this example we will explor th Boston
dataset contained within the MASS
package. We will seek to predict the medv (median house value, in $1000) using rm (average house room number), age
(average age of houses in neighborhood) and lstat
(percentage of households with low socioeconomic status).
# First lets explore that Boston dataset
dim(Boston)
## [1] 506 14
names(Boston)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
#summary(Boston)
# Plot scatterplo matrix to explore relationships/correlations
ggpairs(Boston[,c("medv", "age", "rm", "lstat")], alpha=0.1)
From the scatterplot matrix we can see there is an inverse relationship between medv
and lstat
; that is, when the median value of houses in the neighborhood increases, there is a corresponding drop in the percentage of households with low socioeconomic status.
ggplot(Boston, aes(x=medv, y=lstat)) +
geom_jitter(alpha=0.7) +
labs(title="lstat vs medv", x="Median value of homes", y="% homes with low socioeconomic status")
From the scatterplot matrix we can also see the positive relationship between medv
and rm
ggplot(Boston, aes(x=medv, y=rm)) +
geom_jitter(alpha=0.7) +
labs(title="rm vs medv", x="Median value of homes", y="Avg house room number")
The correlation between medv
and age
is less apparent however, there does seem to be a negative relation ship. It appears as if the median value of the house decreases slightly with age.
ggplot(Boston, aes(x=medv, y=age)) +
geom_jitter(alpha=0.7) +
labs(title="age vs medv", x="Median value of homes", y="Average age of house")
Now lets start building the linear regression model. First we’ll make a model using a single regression (feature/covariate) and then we’ll build the model up by including more features. To start, lets try regressing medv
on rm
and see how that works.
fit <- lm(medv ~ lstat, data=Boston)
summary(fit)
##
## Call:
## lm(formula = medv ~ lstat, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.168 -3.990 -1.318 2.034 24.500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.55384 0.56263 61.41 <2e-16 ***
## lstat -0.95005 0.03873 -24.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432
## F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16
confint(fit)
## 2.5 % 97.5 %
## (Intercept) 33.448457 35.6592247
## lstat -1.026148 -0.8739505
# Check to see how the model fits
intercept <- coef(fit)[1]
slope <- coef(fit)[2]
ggplot(Boston, aes(x=lstat, y=medv)) +
geom_point(position=position_jitter(), alpha=0.6) +
geom_abline(intercept=intercept, slope=slope, col="red")
From the summary output we can see that both parameter estimates are significant, that is both the intercept and the slope have P-values < 0.05. Additionally, the slope parameter is greater than 0 and the confidence intervals for the slope paramter do not cross 0, suggesting a significant relationship between rm
and medv
. In fact, the slope is 9.102 which suggests that each additional room changes the median value of the house by $9,102.
We can check assumptions about our model such as equal variance by plotting the residuals by calling the plot()
method. This will also tell us if our data is non-linear. If the data is non-linear than the residuals will exhibit some sort of pattern when plotted.
par(mfrow=c(2,2))
plot(fit)
On the basis of the residuals there appears to be some non-linearity in our data. Additinally, residuals and studentized residuals can be plotted using studres(fit)
and residuals(fit)
respectively. Leverage statistics can be computed using hatvalues(fit)
.
Once we have a working model we can make predictions using the predict()
method. In this example we will predict the median value of houses in a neighborhodo given average room sizes of houses in that neighborhood.
# Create a dataframe containing houses with various room sizes
new_data <- data.frame(lstat=c(8, 9, 10, 15, 20, 30))
# Return predictions with confidence intervals
predict(fit, new_data, interval="confidence")
## fit lwr upr
## 1 26.95345 26.305287 27.601605
## 2 26.00340 25.393471 26.613322
## 3 25.05335 24.474132 25.632563
## 4 20.30310 19.731588 20.874613
## 5 15.55285 14.773550 16.332157
## 6 6.05236 4.625004 7.479716
# Return predictions with prediciton intervals
predict(fit, new_data, interval="prediction")
## fit lwr upr
## 1 26.95345 14.724265 39.18263
## 2 26.00340 13.776182 38.23061
## 3 25.05335 12.827626 37.27907
## 4 20.30310 8.077742 32.52846
## 5 15.55285 3.316021 27.78969
## 6 6.05236 -6.242765 18.34749
You can see that the predicted medv
appears to increase by about $9,102 for every one unit increase in rm
.
In the above example we created a model to predict medv
using only a single variable, rm
however we can build more complex models by including more regressiors. For example, we could include age
, lstat
as well as rm
into our model. We could the determine which variables significantly influence (i.e. predict) medv
.
fit <- lm(medv ~ lstat + age, data=Boston)
summary(fit)
##
## Call:
## lm(formula = medv ~ lstat + age, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.981 -3.978 -1.283 1.968 23.158
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.22276 0.73085 45.458 < 2e-16 ***
## lstat -1.03207 0.04819 -21.416 < 2e-16 ***
## age 0.03454 0.01223 2.826 0.00491 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.173 on 503 degrees of freedom
## Multiple R-squared: 0.5513, Adjusted R-squared: 0.5495
## F-statistic: 309 on 2 and 503 DF, p-value: < 2.2e-16
We can create a model with all 13 variables in Boston
dataset
# We can use . to tell the lm function to use all variables
fit <- lm(medv ~ . , data=Boston)
summary(fit)
##
## Call:
## lm(formula = medv ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.595 -2.730 -0.518 1.777 26.199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
## crim -1.080e-01 3.286e-02 -3.287 0.001087 **
## zn 4.642e-02 1.373e-02 3.382 0.000778 ***
## indus 2.056e-02 6.150e-02 0.334 0.738288
## chas 2.687e+00 8.616e-01 3.118 0.001925 **
## nox -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
## rm 3.810e+00 4.179e-01 9.116 < 2e-16 ***
## age 6.922e-04 1.321e-02 0.052 0.958229
## dis -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
## rad 3.060e-01 6.635e-02 4.613 5.07e-06 ***
## tax -1.233e-02 3.760e-03 -3.280 0.001112 **
## ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
## black 9.312e-03 2.686e-03 3.467 0.000573 ***
## lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
## F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
fit_summary <- summary(fit)
fit_summary$r.sq
## [1] 0.7406427
fit_summary$sigma
## [1] 4.745298
The car
package can be used to compute the variance inflation factors
library(car)
vif(fit)
## crim zn indus chas nox rm age dis
## 1.792192 2.298758 3.991596 1.073995 4.393720 1.933744 3.100826 3.955945
## rad tax ptratio black lstat
## 7.484496 9.008554 1.799084 1.348521 2.941491
We can create a model using all but one of the variables
# Create a new model with all variables except age
fit <- lm(medv ~ . -age, data=Boston)
# Update a model to include age
fit <- update(fit, ~ . + age, data=Boston)
summary(fit)
##
## Call:
## lm(formula = medv ~ crim + zn + indus + chas + nox + rm + dis +
## rad + tax + ptratio + black + lstat + age, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.595 -2.730 -0.518 1.777 26.199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
## crim -1.080e-01 3.286e-02 -3.287 0.001087 **
## zn 4.642e-02 1.373e-02 3.382 0.000778 ***
## indus 2.056e-02 6.150e-02 0.334 0.738288
## chas 2.687e+00 8.616e-01 3.118 0.001925 **
## nox -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
## rm 3.810e+00 4.179e-01 9.116 < 2e-16 ***
## dis -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
## rad 3.060e-01 6.635e-02 4.613 5.07e-06 ***
## tax -1.233e-02 3.760e-03 -3.280 0.001112 **
## ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
## black 9.312e-03 2.686e-03 3.467 0.000573 ***
## lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
## age 6.922e-04 1.321e-02 0.052 0.958229
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
## F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
We can include interaction effects between variables by using rm:lstat
which includes only the interaction in the model. However, if we use rm*lstat
the notation will automatically include the individual variables in the model.
fit <- lm(medv ~ age:rm, data=Boston)
fit2 <- lm(medv ~ age*rm, data=Boston)
summary(fit)
##
## Call:
## lm(formula = medv ~ age:rm, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.335 -5.097 -1.917 2.595 31.089
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.818720 1.058235 25.343 < 2e-16 ***
## age:rm -0.010056 0.002297 -4.378 1.46e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.036 on 504 degrees of freedom
## Multiple R-squared: 0.03663, Adjusted R-squared: 0.03472
## F-statistic: 19.16 on 1 and 504 DF, p-value: 1.459e-05
summary(fit2)
##
## Call:
## lm(formula = medv ~ age * rm, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.864 -3.059 -0.182 2.191 38.604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -59.59286 7.78687 -7.653 1.01e-13 ***
## age 0.38153 0.09673 3.944 9.15e-05 ***
## rm 13.74747 1.20186 11.438 < 2e-16 ***
## age:rm -0.07141 0.01512 -4.722 3.03e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.186 on 502 degrees of freedom
## Multiple R-squared: 0.5503, Adjusted R-squared: 0.5476
## F-statistic: 204.7 on 3 and 502 DF, p-value: < 2.2e-16
We’ve seen that there seems to be a non-linear relationship between medv
and lstat
we can fit a non-linear model to this by performing some sort of tranformation on the variables.
quad_fit <- lm(medv ~ lstat + I(lstat^2), data=Boston)
linear_fit <- lm(medv ~ lstat, data=Boston)
predictions <- data.frame(pred=predict(quad_fit, data.frame(lstat=1:30)))
ggplot(Boston, aes(x=lstat, y=medv)) + geom_point(alpha=0.5) +
geom_line(data=predictions, aes(x=1:dim(predictions)[1], y=pred), color="red") +
geom_abline(intercept=intercept, slope=slope, color="blue")
The plot above shows the two different models; the linear model and the quadratic model it appears that the non-linear model fits the data better, however to test this more formally using anova()
# Assess whether the null hypothesis is true, i.e. the two models are the same
anova(quad_fit, linear_fit)
## Analysis of Variance Table
##
## Model 1: medv ~ lstat + I(lstat^2)
## Model 2: medv ~ lstat
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 503 15347
## 2 504 19472 -1 -4125.1 135.2 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Lets inspect the residuals and leverage statistics for the new plot
par(mfrow=c(2,2))
plot(quad_fit)
Using the poly()
function will include all lower-order polynomial terms up to the specified order.
fit <- lm(medv ~ poly(lstat, 5), data=Boston)
summary(fit)
##
## Call:
## lm(formula = medv ~ poly(lstat, 5), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5433 -3.1039 -0.7052 2.0844 27.1153
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.5328 0.2318 97.197 < 2e-16 ***
## poly(lstat, 5)1 -152.4595 5.2148 -29.236 < 2e-16 ***
## poly(lstat, 5)2 64.2272 5.2148 12.316 < 2e-16 ***
## poly(lstat, 5)3 -27.0511 5.2148 -5.187 3.10e-07 ***
## poly(lstat, 5)4 25.4517 5.2148 4.881 1.42e-06 ***
## poly(lstat, 5)5 -19.2524 5.2148 -3.692 0.000247 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.215 on 500 degrees of freedom
## Multiple R-squared: 0.6817, Adjusted R-squared: 0.6785
## F-statistic: 214.2 on 5 and 500 DF, p-value: < 2.2e-16
Qualtative variables can be used in regression models too, to see how R encodes these variables we can use the contrasts()
function
data(Carseats)
head(Carseats)
## Sales CompPrice Income Advertising Population Price ShelveLoc Age
## 1 9.50 138 73 11 276 120 Bad 42
## 2 11.22 111 48 16 260 83 Good 65
## 3 10.06 113 35 10 269 80 Medium 59
## 4 7.40 117 100 4 466 97 Medium 55
## 5 4.15 141 64 3 340 128 Bad 38
## 6 10.81 124 113 13 501 72 Bad 78
## Education Urban US
## 1 17 Yes Yes
## 2 10 Yes Yes
## 3 12 Yes Yes
## 4 14 Yes Yes
## 5 13 Yes No
## 6 16 No Yes
contrasts(Carseats$ShelveLoc)
## Good Medium
## Bad 0 0
## Good 1 0
## Medium 0 1
Now lets try to predict carseat sales, Sales
using the features of Carseats
. We will include interaction terms for Income
and Advertising
as well as Price
and Age
fit <- lm(Sales ~ . + Income:Advertising + Price:Age, data=Carseats)
summary(fit)
##
## Call:
## lm(formula = Sales ~ . + Income:Advertising + Price:Age, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9208 -0.7503 0.0177 0.6754 3.3413
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.5755654 1.0087470 6.519 2.22e-10 ***
## CompPrice 0.0929371 0.0041183 22.567 < 2e-16 ***
## Income 0.0108940 0.0026044 4.183 3.57e-05 ***
## Advertising 0.0702462 0.0226091 3.107 0.002030 **
## Population 0.0001592 0.0003679 0.433 0.665330
## Price -0.1008064 0.0074399 -13.549 < 2e-16 ***
## ShelveLocGood 4.8486762 0.1528378 31.724 < 2e-16 ***
## ShelveLocMedium 1.9532620 0.1257682 15.531 < 2e-16 ***
## Age -0.0579466 0.0159506 -3.633 0.000318 ***
## Education -0.0208525 0.0196131 -1.063 0.288361
## UrbanYes 0.1401597 0.1124019 1.247 0.213171
## USYes -0.1575571 0.1489234 -1.058 0.290729
## Income:Advertising 0.0007510 0.0002784 2.698 0.007290 **
## Price:Age 0.0001068 0.0001333 0.801 0.423812
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.011 on 386 degrees of freedom
## Multiple R-squared: 0.8761, Adjusted R-squared: 0.8719
## F-statistic: 210 on 13 and 386 DF, p-value: < 2.2e-16