Regression Week 1: Simple Linear Regression Assignment

Predicting House Prices (One Feature)

In this notebook we will use data on house sales in King County, where Seattle is located, to predict house prices using simple (One Feature) linear regression. We will:

1. Load data

train_data = read.csv("kc_house_train_data.csv", header=T, sep=",")
test_data = read.csv("kc_house_test_data.csv", header=T, sep=",")

2. Write a generic function that accepts a column of data (e.g a vector) 'input_feature' and another column 'output' and returns the Simple Linear Regression parameters 'intercept' and 'slope'. Use the closed form solution from lecture to calculate the slope and intercept.

simple_linear_regression = function(input_features, output){
    sum_y = sum(output)
    sum_x = sum(input_features)
    sum_yx = sum(output*input_features)
    sum_xx = sum(input_features**2)
    n = length(output)
    slope = (sum_yx - ((sum_y*sum_x)/n))/(sum_xx - ((sum_x**2)/n))
    intercept = (sum_y/n) - slope*(sum_x/n)
    return(list(slope=slope, intercept=intercept))
}

3. Use the function to calculate the estimated slope and intercept on the training data to predict price given sqft_living

sqft = simple_linear_regression(train_data$sqft_living, train_data$price)
sqft
## $slope
## [1] 281.9588
## 
## $intercept
## [1] -47116.08

4. Write a function that accepts a column of data input_features the slope and the intercept you learned, and returns a column of predictions 'predicted_output' for each entry in the input column.

get_regression_predictions = function(input_features, intercept, slope){
    return(intercept + slope*input_features)
}

5. QUIZ QUESTION: Using the slope and intercept from (4), what is the predicted price for a house with 2,650 sqft.

get_regression_predictions(2650, sqft$intercept, sqft$slope)
## [1] 700074.8

6. Write a function that accepts a column of data: input_features and output and the regression parameters slope and intercept and returns the Residual Sum of Squares (RSS).

get_residual_sum_of_squares = function(input_features, output, intercept, slope){
    y_hat = intercept + slope*input_features
    return(sum((output - y_hat)**2))
}

7. QUIZ QUESTION: According to this function and the slope and intercept from (4) what is the RSS for the simple linear regression using sqft to predict prices on the TRAINING data?

get_residual_sum_of_squares(train_data$sqft_living, train_data$price, sqft$intercept, sqft$slope)
## [1] 1.201918e+15

8. Write a function that accepts a column of data output and the regression parameters slope and intercept and outputs the colum of data estimated_input.

inverse_regression_predictions = function(output, intercept, slope){
    return((output - intercept)/slope)
}

9. QUIZ QUESTION: According to this function and the regression slope and intercept from (3) what is the estimated sqft for a house costing $800,000?

inverse_regression_predictions(800000, sqft$intercept, sqft$slope)
## [1] 3004.396

10. Use the function from (3) to calculate the Simple Linear Regression parameters slope and intercept for estimating price based on number of bedrooms. Save this slope and intercept for later.

bedroom = simple_linear_regression(train_data$bedrooms, train_data$price)
bedroom
## $slope
## [1] 127589
## 
## $intercept
## [1] 109473.2

11. Compute RSS from both models using TEST data

rss_sqft = get_residual_sum_of_squares(test_data$sqft_living, test_data$price, sqft$intercept, sqft$slope)
rss_bedroom = get_residual_sum_of_squares(test_data$bedroom, test_data$price, bedroom$intercept, bedroom$slope)
c(rss_sqft, rss_bedroom)
## [1] 2.754029e+14 4.933646e+14

12. Compare the RSS from both models, which model has the smallest residual sum of squares? Why do you think this is the case?

The model that uses house size (square feet) has the smallest RSS. This is likely do to the fact that square footage is more predictive of housing price than the number of bedrooms.