1. Create a function that adds polynomial features to a data frame.
require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
polynomial = function(feature, power){
return(data.frame(sapply(1:power, function(x){(feature)**x})))
}
2. For the remainder of the assignment we will be working with the house
Sales data as in the previous notebooks. Load in the data and also sort the
sales SFrame by 'sqft_living'. When we plot the fitted values we want to join
them up in a line and this works best if the variable on the X-axis (which will
be sqft_living
) is sorted. For houses with identical square footage, we break
the tie by their prices.
sales = read.csv("kc_house_data.csv", sep=",", header=T)
sales = sales %>% arrange(sqft_living, price)
3. Make a 1 degree polynomial SFrame with sales['sqft_living']
as the feature.
Call it poly1_data
poly1_data = data.frame(sqft_living=sales$sqft_living)
4. Add sales['price']
to poly1_data
as this will be our output variable.
poly1_data$price = sales$price
5. Compute the regression weights for predicting sales['price']
based on the
1 degree polynomial feature sqft_living
. The result should be an intercept and
slope
model1 = lm(price ~ sqft_living, data = poly1_data)
summary(model1)
##
## Call:
## lm(formula = price ~ sqft_living, data = poly1_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1476062 -147486 -24043 106182 4362067
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -43580.743 4402.690 -9.899 <2e-16 ***
## sqft_living 280.624 1.936 144.920 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 261500 on 21611 degrees of freedom
## Multiple R-squared: 0.4929, Adjusted R-squared: 0.4928
## F-statistic: 2.1e+04 on 1 and 21611 DF, p-value: < 2.2e-16
model_params = coefficients(model1)
6. Next produce a scatter plot of the training data (just square feet vs price) and add the fitted model
require(ggplot2)
## Loading required package: ggplot2
ggplot(poly1_data, aes(x=sqft_living, y=price)) + geom_point(alpha=0.25) +
geom_abline(intercept=model_params[1], slope=model_params[2], color="red") +
theme_bw()
7. Repeat the steps above using 2nd degree and 3rd degree polynomials. Look at the fitted lines, do they appear as you would expect?
poly2_data = polynomial(sales$sqft_living, 2)
poly3_data = polynomial(sales$sqft_living, 3)
poly2_data$price = sales$price
poly3_data$price = sales$price
model2 = lm(price ~ ., data=poly2_data)
model3 = lm(price ~ ., data=poly3_data)
model2_dat = data.frame(sqft_living = sales$sqft_living)
model2_dat$price = model2$fitted.values
model3_dat = data.frame(sqft_living = sales$sqft_living)
model3_dat$price = model3$fitted.values
# 2nd degree polynomial fit
ggplot(sales, aes(x=sqft_living, y=price)) + geom_point(alpha=0.25) +
geom_line(data=model2_dat, color="red") +
geom_line(data=model3_dat, color="blue") +
theme_bw()
8. Now try a 15th degree polynomial. Print out the coefficients and look at the resulted fitted line.
poly15_data = polynomial(sales$sqft_living, 15)
poly15_data$price = sales$price
model15 = lm(price ~ ., data = poly15_data)
summary(model15)
##
## Call:
## lm(formula = price ~ ., data = poly15_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2000929 -139479 -27743 98428 2573431
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.866e+06 1.077e+06 -1.732 0.08331 .
## X1 1.457e+04 6.716e+03 2.170 0.03004 *
## X2 -4.265e+01 1.783e+01 -2.392 0.01676 *
## X3 7.013e-02 2.682e-02 2.615 0.00894 **
## X4 -7.213e-05 2.570e-05 -2.806 0.00502 **
## X5 4.950e-08 1.672e-08 2.961 0.00307 **
## X6 -2.360e-11 7.666e-12 -3.079 0.00208 **
## X7 8.027e-15 2.539e-15 3.162 0.00157 **
## X8 -1.977e-18 6.151e-19 -3.215 0.00131 **
## X9 3.552e-22 1.096e-22 3.243 0.00119 **
## X10 -4.645e-26 1.429e-26 -3.251 0.00115 **
## X11 4.363e-30 1.345e-30 3.245 0.00118 **
## X12 -2.865e-34 8.874e-35 -3.228 0.00125 **
## X13 1.246e-38 3.888e-39 3.205 0.00135 **
## X14 -3.222e-43 1.014e-43 -3.179 0.00148 **
## X15 3.746e-48 1.189e-48 3.151 0.00163 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 245500 on 21597 degrees of freedom
## Multiple R-squared: 0.5531, Adjusted R-squared: 0.5528
## F-statistic: 1782 on 15 and 21597 DF, p-value: < 2.2e-16
model15_dat = data.frame(sqft_living = sales$sqft_living)
model15_dat$price = model15$fitted.values
ggplot(sales, aes(x=sqft_living, y=price)) + geom_point(alpha=0.25) +
geom_line(data=model15_dat, color="red") +
theme_bw()
9. Use 4 different subsets of data and estimate a 15 degree polynomial on all 4 sets, plot the results and vew the coefficients for all four models
set_1 = read.csv("wk3_kc_house_set_1_data.csv", sep=",", header=T)
set_2 = read.csv("wk3_kc_house_set_2_data.csv", sep=",", header=T)
set_3 = read.csv("wk3_kc_house_set_3_data.csv", sep=",", header=T)
set_4 = read.csv("wk3_kc_house_set_4_data.csv", sep=",", header=T)
p1_dat = polynomial(set_1$sqft_living, 15)
p2_dat = polynomial(set_2$sqft_living, 15)
p3_dat = polynomial(set_3$sqft_living, 15)
p4_dat = polynomial(set_4$sqft_living, 15)
p1_dat$price = set_1$price
p2_dat$price = set_2$price
p3_dat$price = set_3$price
p4_dat$price = set_4$price
m1 = lm(price ~ ., data = p1_dat)
m2 = lm(price ~ ., data = p2_dat)
m3 = lm(price ~ ., data = p3_dat)
m4 = lm(price ~ ., data = p4_dat)
summary(m1); summary(m2); summary(m3); summary(m4)
##
## Call:
## lm(formula = price ~ ., data = p1_dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1330184 -140294 -30366 98406 2378047
##
## Coefficients: (3 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.270e+06 1.140e+06 -2.868 0.004141 **
## X1 1.716e+04 5.340e+03 3.213 0.001322 **
## X2 -3.514e+01 1.045e+01 -3.363 0.000775 ***
## X3 4.030e-02 1.131e-02 3.562 0.000371 ***
## X4 -2.864e-05 7.581e-06 -3.779 0.000159 ***
## X5 1.331e-08 3.325e-09 4.002 6.36e-05 ***
## X6 -4.150e-12 9.827e-13 -4.223 2.45e-05 ***
## X7 8.762e-16 1.976e-16 4.434 9.42e-06 ***
## X8 -1.242e-19 2.682e-20 -4.632 3.71e-06 ***
## X9 1.146e-23 2.380e-24 4.814 1.52e-06 ***
## X10 -6.359e-28 1.277e-28 -4.978 6.62e-07 ***
## X11 1.703e-32 3.324e-33 5.124 3.09e-07 ***
## X12 NA NA NA NA
## X13 -7.173e-42 1.337e-42 -5.364 8.50e-08 ***
## X14 NA NA NA NA
## X15 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 247800 on 5391 degrees of freedom
## Multiple R-squared: 0.5999, Adjusted R-squared: 0.599
## F-statistic: 673.6 on 12 and 5391 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = price ~ ., data = p2_dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1535702 -138345 -22176 96245 1898946
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.861e+05 2.472e+06 0.278 0.781
## X1 -2.911e+03 1.552e+04 -0.188 0.851
## X2 8.828e+00 4.122e+01 0.214 0.830
## X3 -1.596e-02 6.160e-02 -0.259 0.796
## X4 1.902e-05 5.812e-05 0.327 0.743
## X5 -1.515e-08 3.676e-08 -0.412 0.680
## X6 8.158e-12 1.611e-11 0.507 0.613
## X7 -3.001e-15 4.969e-15 -0.604 0.546
## X8 7.565e-19 1.081e-18 0.700 0.484
## X9 -1.293e-22 1.638e-22 -0.789 0.430
## X10 1.452e-26 1.667e-26 0.871 0.384
## X11 -9.908e-31 1.051e-30 -0.942 0.346
## X12 3.299e-35 3.288e-35 1.004 0.316
## X13 NA NA NA NA
## X14 -2.226e-44 2.031e-44 -1.096 0.273
## X15 NA NA NA NA
##
## Residual standard error: 234500 on 5384 degrees of freedom
## Multiple R-squared: 0.5417, Adjusted R-squared: 0.5406
## F-statistic: 489.6 on 13 and 5384 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = price ~ ., data = p3_dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1234825 -143072 -28882 97712 2263872
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.448e+06 2.138e+06 -0.677 0.4982
## X1 1.017e+04 1.177e+04 0.864 0.3876
## X2 -2.656e+01 2.748e+01 -0.967 0.3337
## X3 3.977e-02 3.612e-02 1.101 0.2709
## X4 -3.741e-05 2.994e-05 -1.249 0.2115
## X5 2.332e-08 1.660e-08 1.405 0.1600
## X6 -9.920e-12 6.352e-12 -1.562 0.1184
## X7 2.920e-15 1.706e-15 1.712 0.0870 .
## X8 -5.957e-19 3.219e-19 -1.851 0.0642 .
## X9 8.317e-23 4.212e-23 1.975 0.0484 *
## X10 -7.677e-27 3.689e-27 -2.081 0.0375 *
## X11 4.328e-31 1.996e-31 2.168 0.0302 *
## X12 -1.194e-35 5.337e-36 -2.237 0.0253 *
## X13 NA NA NA NA
## X14 5.563e-45 2.394e-45 2.324 0.0202 *
## X15 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 250800 on 5395 degrees of freedom
## Multiple R-squared: 0.5738, Adjusted R-squared: 0.5727
## F-statistic: 558.7 on 13 and 5395 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = price ~ ., data = p4_dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2314399 -136911 -28136 101650 2033680
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.292e+06 3.017e+06 -0.428 0.668
## X1 1.151e+04 2.336e+04 0.493 0.622
## X2 -3.632e+01 7.569e+01 -0.480 0.631
## X3 6.463e-02 1.377e-01 0.469 0.639
## X4 -7.259e-05 1.589e-04 -0.457 0.648
## X5 5.519e-08 1.239e-07 0.445 0.656
## X6 -2.963e-11 6.775e-11 -0.437 0.662
## X7 1.152e-14 2.654e-14 0.434 0.664
## X8 -3.270e-18 7.509e-18 -0.436 0.663
## X9 6.756e-22 1.530e-21 0.442 0.659
## X10 -9.959e-26 2.206e-25 -0.451 0.652
## X11 1.005e-29 2.169e-29 0.463 0.643
## X12 -6.379e-34 1.338e-33 -0.477 0.634
## X13 2.029e-38 4.135e-38 0.491 0.624
## X14 NA NA NA NA
## X15 -1.327e-47 2.568e-47 -0.517 0.605
##
## Residual standard error: 244500 on 5387 degrees of freedom
## Multiple R-squared: 0.4989, Adjusted R-squared: 0.4976
## F-statistic: 383.1 on 14 and 5387 DF, p-value: < 2.2e-16
set_1$group = 1; set_2$group = 2; set_3$group = 3; set_4$group = 4
set_1$pred = m1$fitted.values; set_2$pred = m2$fitted.values; set_3$pred = m3$fitted.values; set_4$pred = m4$fitted.values
all_dat = rbind(set_1, set_2, set_3, set_4)
# Plot all 4 models
ggplot(all_dat, aes(x=sqft_living, y=price, color=as.factor(group))) +
geom_point(alpha=0.25) +
geom_line(aes(x=sqft_living, y=pred)) +
theme_bw() +
facet_wrap(~ group)
10. QUIZ QUESTION: Is the sign (positive or negative) for power15
the same in all four models?
No
11. QUIZ QUESTION: True/False the plotted fittedliens look the same in all four plots False
12. Since the 'best' polynomial degree is unknown to us we will use cross validation to select the best degree.
test_set = read.csv("wk3_kc_house_test_data.csv", sep=",", header=T)
train_set = read.csv("wk3_kc_house_train_data.csv", sep=",", header=T)
valid_set = read.csv("wk3_kc_house_valid_data.csv", sep=",", header=T)
13. For each degree from 1-15 build a training set using sqft_living
as the feature, train a linear model
on this data and compute the RSS using the validation data
for(i in 1:15){
# Train model using training data
dat = polynomial(train_set$sqft_living, i)
dat$price = train_set$price
model = lm(price ~ ., data=dat)
# Compute RSS using test/validation set
valid_dat = polynomial(valid_set$sqft_living, i)
rss = sum((valid_set$price - predict(model, newdata=valid_dat))**2) / 1e9
print(paste(c("Degree: ", i, " RSS: ", rss), collapse=""))
}
## [1] "Degree: 1 RSS: 629097.886299586"
## [1] "Degree: 2 RSS: 623955.062706516"
## [1] "Degree: 3 RSS: 625820.280251537"
## [1] "Degree: 4 RSS: 629987.337747604"
## [1] "Degree: 5 RSS: 620045.618882205"
## [1] "Degree: 6 RSS: 620118.963163648"
## [1] "Degree: 7 RSS: 986938.97900047"
## [1] "Degree: 8 RSS: 695990.293710956"
## [1] "Degree: 9 RSS: 34150525.2813713"
## [1] "Degree: 10 RSS: 726976.563935924"
## [1] "Degree: 11 RSS: 10973217323.1916"
## [1] "Degree: 12 RSS: 920547759841.858"
## [1] "Degree: 13 RSS: 98992401633101.4"
## [1] "Degree: 14 RSS: 1241808577677994"
## Warning in predict.lm(model, newdata = valid_dat): prediction from a rank-
## deficient fit may be misleading
## [1] "Degree: 15 RSS: 1241808577677994"
14. QUIZ QUESTION: Which degree (1,2..15) had the lowest RSS on Validation data?
Degree 5
15. Now that you have selected a degree compute the RSS on TEST data for the model with the best degree from the Validation data
dat = polynomial(train_set$sqft_living, 5)
dat$price = train_set$price
m5 = lm(price ~ ., data=dat)
rss = sum((test_set$price - predict(m5, newdata=polynomial(test_set$sqft_living, 5)))**2)
rss
## [1] 1.355672e+14
16. QUIZ QUESTION: What is the RSS on TEST data for the model with the degree selected from Validation data?
rss
## [1] 1.355672e+14