Predicting House Prices (One Feature)
I this notebook we will use data on house sales in King COunty, where Seattle is located, to predict hosue prices using simple (one feature) linear regression. We will:
# 1. Import graphlab and laod in the house data.
import graphlab
sales = graphlab.SFrame("kc_house_data.gl")
# 2. Split data into 80% training and 20% test data.
train_data, test_data = sales.random_split(0.8, seed=0)
# 3. Write a generic function that accepts a column of data (e.g. an SArray)
# 'input_features' and another column 'output' and returns the Simple
# Linear Regression parameters 'intercept' and 'slope'. Use the closed
# form solution from lecture to calculate the slope and intercept.
def simple_linear_regression(input_features, output):
sum_y = output.sum()
sum_x = input_features.sum()
sum_yx = (output*input_features).sum()
sum_xx = (input_features**2).sum()
n = float(len(output))
slope = (sum_yx - ((sum_y*sum_x)/n))/(sum_xx - ((sum_x*sum_x)/n))
intercept = (sum_y/n) - slope*(sum_x/n)
return(intercept, slope)
# 4. Use the function to calculate the esimated slope and intercept on
# on the training data to predict 'price' given 'sqft_living'
sqft_icept, sqft_slope = simple_linear_regression(train_data['sqft_living'], train_data['price'])
print "Sqft intercept: ", sqft_icept
print "Sqft slope: ", sqft_slope
# 5. Write a function that accepts a column of data 'input_features',
# the 'slope' and the 'intercept' you learned, and returns a column of
# predictions 'predicted_output' for each entry in the input column
def get_regression_predictions(input_features, intercept, slope):
return intercept + slope*input_features
# 6. QUIZ QUESTION: Using the slope and intercept from (4), what is the
# predicted price for a house with 2650 sqft.
predicted_price = get_regression_predictions(2650, sqft_icept, sqft_slope)
predicted_price
# 7. Write a function that accepts a column of data: 'input_features' and
# 'output' and the regression parameters 'slope' and 'intercept' and
# and outputs the Residual Sum of Squares (RSS)
def get_residual_sum_of_squares(input_features, output, intercept, slope):
y_hat = intercept + input_features*slope
return ((output-y_hat)**2).sum()
# 8. QUIZ QUESTION: According to this function and the slope and intercept
# from (4) what is the RSS for the simple linear regression using squarefeet
# to predict prices on TRAINING data?
rss = get_residual_sum_of_squares(train_data['sqft_living'], train_data['price'], sqft_icept, sqft_slope)
print rss
# 9. Write a function that accepts a column of data 'output' and the
# regression parameters 'slope' and 'intercept' and outputs the column
# of data 'estimated_input'
def inverse_regression_predictions(output, intercept, slope):
return (output - intercept)/float(slope)
# 10. QUIZ QUESTION: According to this function and the regression slope
# and intercept from (3) what is the estimated square-feet for a house
# costing $800,000?
estimate_sqft = inverse_regression_predictions(800000, sqft_icept, sqft_slope)
print estimate_sqft
# 11. Use the function from (3) to calculate the Simple Linear Regression
# slope and intercept for estimating price based on bedrooms. Save this
# slope and intercept for later.
bedroom_icept, bedroom_slope = simple_linear_regression(train_data['bedrooms'], train_data['price'])
print "Bedroom intercept: ", bedroom_icept
print "Bedroom slope: ", bedroom_slope
# 12. Compute RSS from both models on the TEST data
rss_sqft = get_residual_sum_of_squares(test_data['sqft_living'], test_data['price'], sqft_icept, sqft_slope)
rss_bedroom = get_residual_sum_of_squares(test_data['bedrooms'], test_data['price'], bedroom_icept, bedroom_slope)
print "RSS sqft: ", rss_sqft
print "RSS bedroom: ", rss_bedroom
# 13. QUIZ QUESTION:: Which model (sqft vs bedroom) has lowest RSS on
# TEST data? Think about why this might be the case.
# Answer: SQFT has lowest RSS, this likely due to the fact that SQFT
# is more predictive of house price than the number of bedrooms.