Predicting House Prices (One Feature)
In this notebook we will use data on house sales in King County, where Seattle is located, to predict house prices using simple (One Feature) linear regression. We will:
1. Load data
train_data = read.csv("kc_house_train_data.csv", header=T, sep=",")
test_data = read.csv("kc_house_test_data.csv", header=T, sep=",")
2. Write a generic function that accepts a column of data (e.g a vector) 'input_feature' and another column 'output' and returns the Simple Linear Regression parameters 'intercept' and 'slope'. Use the closed form solution from lecture to calculate the slope and intercept.
simple_linear_regression = function(input_features, output){
sum_y = sum(output)
sum_x = sum(input_features)
sum_yx = sum(output*input_features)
sum_xx = sum(input_features**2)
n = length(output)
slope = (sum_yx - ((sum_y*sum_x)/n))/(sum_xx - ((sum_x**2)/n))
intercept = (sum_y/n) - slope*(sum_x/n)
return(list(slope=slope, intercept=intercept))
}
3. Use the function to calculate the estimated slope and intercept
on the training data to predict price
given sqft_living
sqft = simple_linear_regression(train_data$sqft_living, train_data$price)
sqft
## $slope
## [1] 281.9588
##
## $intercept
## [1] -47116.08
4. Write a function that accepts a column of data input_features
the slope
and the intercept
you learned, and returns a column
of predictions 'predicted_output' for each entry in the input column.
get_regression_predictions = function(input_features, intercept, slope){
return(intercept + slope*input_features)
}
5. QUIZ QUESTION: Using the slope and intercept from (4), what is the predicted price for a house with 2,650 sqft.
get_regression_predictions(2650, sqft$intercept, sqft$slope)
## [1] 700074.8
6. Write a function that accepts a column of data: input_features
and
output
and the regression parameters slope
and intercept
and returns
the Residual Sum of Squares (RSS).
get_residual_sum_of_squares = function(input_features, output, intercept, slope){
y_hat = intercept + slope*input_features
return(sum((output - y_hat)**2))
}
7. QUIZ QUESTION: According to this function and the slope and intercept from (4) what is the RSS for the simple linear regression using sqft to predict prices on the TRAINING data?
get_residual_sum_of_squares(train_data$sqft_living, train_data$price, sqft$intercept, sqft$slope)
## [1] 1.201918e+15
8. Write a function that accepts a column of data output
and the regression
parameters slope
and intercept
and outputs the colum of data estimated_input
.
inverse_regression_predictions = function(output, intercept, slope){
return((output - intercept)/slope)
}
9. QUIZ QUESTION: According to this function and the regression slope and intercept from (3) what is the estimated sqft for a house costing $800,000?
inverse_regression_predictions(800000, sqft$intercept, sqft$slope)
## [1] 3004.396
10. Use the function from (3) to calculate the Simple Linear Regression parameters
slope
and intercept
for estimating price based on number of bedrooms. Save this
slope and intercept for later.
bedroom = simple_linear_regression(train_data$bedrooms, train_data$price)
bedroom
## $slope
## [1] 127589
##
## $intercept
## [1] 109473.2
11. Compute RSS from both models using TEST data
rss_sqft = get_residual_sum_of_squares(test_data$sqft_living, test_data$price, sqft$intercept, sqft$slope)
rss_bedroom = get_residual_sum_of_squares(test_data$bedroom, test_data$price, bedroom$intercept, bedroom$slope)
c(rss_sqft, rss_bedroom)
## [1] 2.754029e+14 4.933646e+14
12. Compare the RSS from both models, which model has the smallest residual sum of squares? Why do you think this is the case?
The model that uses house size (square feet) has the smallest RSS. This is likely do to the fact that square footage is more predictive of housing price than the number of bedrooms.