Week 3 Polynomial Regression - R

1. Create a function that adds polynomial features to a data frame.

require(dplyr)

## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

polynomial = function(feature, power){
  return(data.frame(sapply(1:power, function(x){(feature)**x})))
}

2. For the remainder of the assignment we will be working with the house Sales data as in the previous notebooks. Load in the data and also sort the sales SFrame by 'sqft_living'. When we plot the fitted values we want to join them up in a line and this works best if the variable on the X-axis (which will be sqft_living) is sorted. For houses with identical square footage, we break the tie by their prices.

sales = read.csv("kc_house_data.csv", sep=",", header=T)
sales = sales %>% arrange(sqft_living, price)

3. Make a 1 degree polynomial SFrame with sales['sqft_living'] as the feature. Call it poly1_data

poly1_data = data.frame(sqft_living=sales$sqft_living)

4. Add sales['price'] to poly1_data as this will be our output variable.

poly1_data$price = sales$price

5. Compute the regression weights for predicting sales['price'] based on the 1 degree polynomial feature sqft_living. The result should be an intercept and slope

model1 = lm(price ~ sqft_living, data = poly1_data)
summary(model1)

## 
## Call:
## lm(formula = price ~ sqft_living, data = poly1_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1476062  -147486   -24043   106182  4362067 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -43580.743   4402.690  -9.899   <2e-16 ***
## sqft_living    280.624      1.936 144.920   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 261500 on 21611 degrees of freedom
## Multiple R-squared:  0.4929, Adjusted R-squared:  0.4928 
## F-statistic: 2.1e+04 on 1 and 21611 DF,  p-value: < 2.2e-16

model_params = coefficients(model1)

6. Next produce a scatter plot of the training data (just square feet vs price) and add the fitted model

require(ggplot2)

## Loading required package: ggplot2

ggplot(poly1_data, aes(x=sqft_living, y=price)) + geom_point(alpha=0.25) +
  geom_abline(intercept=model_params[1], slope=model_params[2], color="red") +
  theme_bw()

plot of chunk unnamed-chunk-6

7. Repeat the steps above using 2nd degree and 3rd degree polynomials. Look at the fitted lines, do they appear as you would expect?

poly2_data = polynomial(sales$sqft_living, 2)
poly3_data = polynomial(sales$sqft_living, 3)
poly2_data$price = sales$price
poly3_data$price = sales$price

model2 = lm(price ~ ., data=poly2_data)
model3 = lm(price ~ ., data=poly3_data)

model2_dat = data.frame(sqft_living = sales$sqft_living)
model2_dat$price = model2$fitted.values

model3_dat = data.frame(sqft_living = sales$sqft_living)
model3_dat$price = model3$fitted.values

# 2nd degree polynomial fit
ggplot(sales, aes(x=sqft_living, y=price)) + geom_point(alpha=0.25) +
  geom_line(data=model2_dat, color="red") +
  geom_line(data=model3_dat, color="blue") +
  theme_bw()

plot of chunk unnamed-chunk-7

8. Now try a 15th degree polynomial. Print out the coefficients and look at the resulted fitted line.

poly15_data = polynomial(sales$sqft_living, 15)
poly15_data$price = sales$price

model15 = lm(price ~ ., data = poly15_data)
summary(model15)

## 
## Call:
## lm(formula = price ~ ., data = poly15_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2000929  -139479   -27743    98428  2573431 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -1.866e+06  1.077e+06  -1.732  0.08331 . 
## X1           1.457e+04  6.716e+03   2.170  0.03004 * 
## X2          -4.265e+01  1.783e+01  -2.392  0.01676 * 
## X3           7.013e-02  2.682e-02   2.615  0.00894 **
## X4          -7.213e-05  2.570e-05  -2.806  0.00502 **
## X5           4.950e-08  1.672e-08   2.961  0.00307 **
## X6          -2.360e-11  7.666e-12  -3.079  0.00208 **
## X7           8.027e-15  2.539e-15   3.162  0.00157 **
## X8          -1.977e-18  6.151e-19  -3.215  0.00131 **
## X9           3.552e-22  1.096e-22   3.243  0.00119 **
## X10         -4.645e-26  1.429e-26  -3.251  0.00115 **
## X11          4.363e-30  1.345e-30   3.245  0.00118 **
## X12         -2.865e-34  8.874e-35  -3.228  0.00125 **
## X13          1.246e-38  3.888e-39   3.205  0.00135 **
## X14         -3.222e-43  1.014e-43  -3.179  0.00148 **
## X15          3.746e-48  1.189e-48   3.151  0.00163 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 245500 on 21597 degrees of freedom
## Multiple R-squared:  0.5531, Adjusted R-squared:  0.5528 
## F-statistic:  1782 on 15 and 21597 DF,  p-value: < 2.2e-16

model15_dat = data.frame(sqft_living = sales$sqft_living)
model15_dat$price = model15$fitted.values

ggplot(sales, aes(x=sqft_living, y=price)) + geom_point(alpha=0.25) +
  geom_line(data=model15_dat, color="red") +
  theme_bw()

plot of chunk unnamed-chunk-8

9. Use 4 different subsets of data and estimate a 15 degree polynomial on all 4 sets, plot the results and vew the coefficients for all four models

set_1 = read.csv("wk3_kc_house_set_1_data.csv", sep=",", header=T)
set_2 = read.csv("wk3_kc_house_set_2_data.csv", sep=",", header=T)
set_3 = read.csv("wk3_kc_house_set_3_data.csv", sep=",", header=T)
set_4 = read.csv("wk3_kc_house_set_4_data.csv", sep=",", header=T)

p1_dat = polynomial(set_1$sqft_living, 15)
p2_dat = polynomial(set_2$sqft_living, 15)
p3_dat = polynomial(set_3$sqft_living, 15)
p4_dat = polynomial(set_4$sqft_living, 15)

p1_dat$price = set_1$price
p2_dat$price = set_2$price
p3_dat$price = set_3$price
p4_dat$price = set_4$price

m1 = lm(price ~ ., data = p1_dat)
m2 = lm(price ~ ., data = p2_dat)
m3 = lm(price ~ ., data = p3_dat)
m4 = lm(price ~ ., data = p4_dat)

summary(m1); summary(m2); summary(m3); summary(m4)

## 
## Call:
## lm(formula = price ~ ., data = p1_dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1330184  -140294   -30366    98406  2378047 
## 
## Coefficients: (3 not defined because of singularities)
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.270e+06  1.140e+06  -2.868 0.004141 ** 
## X1           1.716e+04  5.340e+03   3.213 0.001322 ** 
## X2          -3.514e+01  1.045e+01  -3.363 0.000775 ***
## X3           4.030e-02  1.131e-02   3.562 0.000371 ***
## X4          -2.864e-05  7.581e-06  -3.779 0.000159 ***
## X5           1.331e-08  3.325e-09   4.002 6.36e-05 ***
## X6          -4.150e-12  9.827e-13  -4.223 2.45e-05 ***
## X7           8.762e-16  1.976e-16   4.434 9.42e-06 ***
## X8          -1.242e-19  2.682e-20  -4.632 3.71e-06 ***
## X9           1.146e-23  2.380e-24   4.814 1.52e-06 ***
## X10         -6.359e-28  1.277e-28  -4.978 6.62e-07 ***
## X11          1.703e-32  3.324e-33   5.124 3.09e-07 ***
## X12                 NA         NA      NA       NA    
## X13         -7.173e-42  1.337e-42  -5.364 8.50e-08 ***
## X14                 NA         NA      NA       NA    
## X15                 NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 247800 on 5391 degrees of freedom
## Multiple R-squared:  0.5999, Adjusted R-squared:  0.599 
## F-statistic: 673.6 on 12 and 5391 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = price ~ ., data = p2_dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1535702  -138345   -22176    96245  1898946 
## 
## Coefficients: (2 not defined because of singularities)
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)  6.861e+05  2.472e+06   0.278    0.781
## X1          -2.911e+03  1.552e+04  -0.188    0.851
## X2           8.828e+00  4.122e+01   0.214    0.830
## X3          -1.596e-02  6.160e-02  -0.259    0.796
## X4           1.902e-05  5.812e-05   0.327    0.743
## X5          -1.515e-08  3.676e-08  -0.412    0.680
## X6           8.158e-12  1.611e-11   0.507    0.613
## X7          -3.001e-15  4.969e-15  -0.604    0.546
## X8           7.565e-19  1.081e-18   0.700    0.484
## X9          -1.293e-22  1.638e-22  -0.789    0.430
## X10          1.452e-26  1.667e-26   0.871    0.384
## X11         -9.908e-31  1.051e-30  -0.942    0.346
## X12          3.299e-35  3.288e-35   1.004    0.316
## X13                 NA         NA      NA       NA
## X14         -2.226e-44  2.031e-44  -1.096    0.273
## X15                 NA         NA      NA       NA
## 
## Residual standard error: 234500 on 5384 degrees of freedom
## Multiple R-squared:  0.5417, Adjusted R-squared:  0.5406 
## F-statistic: 489.6 on 13 and 5384 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = price ~ ., data = p3_dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1234825  -143072   -28882    97712  2263872 
## 
## Coefficients: (2 not defined because of singularities)
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -1.448e+06  2.138e+06  -0.677   0.4982  
## X1           1.017e+04  1.177e+04   0.864   0.3876  
## X2          -2.656e+01  2.748e+01  -0.967   0.3337  
## X3           3.977e-02  3.612e-02   1.101   0.2709  
## X4          -3.741e-05  2.994e-05  -1.249   0.2115  
## X5           2.332e-08  1.660e-08   1.405   0.1600  
## X6          -9.920e-12  6.352e-12  -1.562   0.1184  
## X7           2.920e-15  1.706e-15   1.712   0.0870 .
## X8          -5.957e-19  3.219e-19  -1.851   0.0642 .
## X9           8.317e-23  4.212e-23   1.975   0.0484 *
## X10         -7.677e-27  3.689e-27  -2.081   0.0375 *
## X11          4.328e-31  1.996e-31   2.168   0.0302 *
## X12         -1.194e-35  5.337e-36  -2.237   0.0253 *
## X13                 NA         NA      NA       NA  
## X14          5.563e-45  2.394e-45   2.324   0.0202 *
## X15                 NA         NA      NA       NA  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 250800 on 5395 degrees of freedom
## Multiple R-squared:  0.5738, Adjusted R-squared:  0.5727 
## F-statistic: 558.7 on 13 and 5395 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = price ~ ., data = p4_dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2314399  -136911   -28136   101650  2033680 
## 
## Coefficients: (1 not defined because of singularities)
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.292e+06  3.017e+06  -0.428    0.668
## X1           1.151e+04  2.336e+04   0.493    0.622
## X2          -3.632e+01  7.569e+01  -0.480    0.631
## X3           6.463e-02  1.377e-01   0.469    0.639
## X4          -7.259e-05  1.589e-04  -0.457    0.648
## X5           5.519e-08  1.239e-07   0.445    0.656
## X6          -2.963e-11  6.775e-11  -0.437    0.662
## X7           1.152e-14  2.654e-14   0.434    0.664
## X8          -3.270e-18  7.509e-18  -0.436    0.663
## X9           6.756e-22  1.530e-21   0.442    0.659
## X10         -9.959e-26  2.206e-25  -0.451    0.652
## X11          1.005e-29  2.169e-29   0.463    0.643
## X12         -6.379e-34  1.338e-33  -0.477    0.634
## X13          2.029e-38  4.135e-38   0.491    0.624
## X14                 NA         NA      NA       NA
## X15         -1.327e-47  2.568e-47  -0.517    0.605
## 
## Residual standard error: 244500 on 5387 degrees of freedom
## Multiple R-squared:  0.4989, Adjusted R-squared:  0.4976 
## F-statistic: 383.1 on 14 and 5387 DF,  p-value: < 2.2e-16

set_1$group = 1; set_2$group = 2; set_3$group = 3; set_4$group = 4
set_1$pred = m1$fitted.values; set_2$pred = m2$fitted.values; set_3$pred = m3$fitted.values; set_4$pred = m4$fitted.values

all_dat = rbind(set_1, set_2, set_3, set_4)

# Plot all 4 models
ggplot(all_dat, aes(x=sqft_living, y=price, color=as.factor(group))) +
  geom_point(alpha=0.25) + 
  geom_line(aes(x=sqft_living, y=pred)) +
  theme_bw() +
  facet_wrap(~ group)

plot of chunk unnamed-chunk-9

10. QUIZ QUESTION: Is the sign (positive or negative) for power15 the same in all four models?

11. QUIZ QUESTION: True/False the plotted fittedliens look the same in all four plots False

12. Since the 'best' polynomial degree is unknown to us we will use cross validation to select the best degree.

test_set = read.csv("wk3_kc_house_test_data.csv", sep=",", header=T)
train_set = read.csv("wk3_kc_house_train_data.csv", sep=",", header=T)
valid_set = read.csv("wk3_kc_house_valid_data.csv", sep=",", header=T)

13. For each degree from 1-15 build a training set using sqft_living as the feature, train a linear model on this data and compute the RSS using the validation data

for(i in 1:15){
  # Train model using training data
  dat = polynomial(train_set$sqft_living, i)
  dat$price = train_set$price
  model = lm(price ~ ., data=dat)

  # Compute RSS using test/validation set
  valid_dat = polynomial(valid_set$sqft_living, i)
  rss = sum((valid_set$price - predict(model, newdata=valid_dat))**2) / 1e9
  print(paste(c("Degree: ", i, " RSS: ", rss), collapse=""))
}

## [1] "Degree: 1 RSS: 629097.886299586"
## [1] "Degree: 2 RSS: 623955.062706516"
## [1] "Degree: 3 RSS: 625820.280251537"
## [1] "Degree: 4 RSS: 629987.337747604"
## [1] "Degree: 5 RSS: 620045.618882205"
## [1] "Degree: 6 RSS: 620118.963163648"
## [1] "Degree: 7 RSS: 986938.97900047"
## [1] "Degree: 8 RSS: 695990.293710956"
## [1] "Degree: 9 RSS: 34150525.2813713"
## [1] "Degree: 10 RSS: 726976.563935924"
## [1] "Degree: 11 RSS: 10973217323.1916"
## [1] "Degree: 12 RSS: 920547759841.858"
## [1] "Degree: 13 RSS: 98992401633101.4"
## [1] "Degree: 14 RSS: 1241808577677994"

## Warning in predict.lm(model, newdata = valid_dat): prediction from a rank-
## deficient fit may be misleading

## [1] "Degree: 15 RSS: 1241808577677994"

14. QUIZ QUESTION: Which degree (1,2..15) had the lowest RSS on Validation data?

Degree 5

15. Now that you have selected a degree compute the RSS on TEST data for the model with the best degree from the Validation data

dat = polynomial(train_set$sqft_living, 5)
dat$price = train_set$price
m5 = lm(price ~ ., data=dat)

rss = sum((test_set$price - predict(m5, newdata=polynomial(test_set$sqft_living, 5)))**2)
rss

## [1] 1.355672e+14

16. QUIZ QUESTION: What is the RSS on TEST data for the model with the degree selected from Validation data?

rss

## [1] 1.355672e+14