Regression Week 1: Simple Linear Regression

Predicting House Prices (One Feature)

I this notebook we will use data on house sales in King COunty, where Seattle is located, to predict hosue prices using simple (one feature) linear regression. We will:

  • Use SArray and SFrame functions to compute important summary statistics
  • Write a function to compute the Simple Linear Regression weights using the closed form solution
  • Write a function to make predictions of the output given the input features.
  • Turn the regression around to predict the input/feature given the output.
  • Compare two different models for predicting house prices
In [1]:
# 1. Import graphlab and laod in the house data.
import graphlab
sales = graphlab.SFrame("kc_house_data.gl")
[INFO] 1451549578 : INFO:     (initialize_globals_from_environment:282): Setting configuration variable GRAPHLAB_FILEIO_ALTERNATIVE_SSL_CERT_FILE to /usr/local/lib/python2.7/dist-packages/certifi/cacert.pem
1451549578 : INFO:     (initialize_globals_from_environment:282): Setting configuration variable GRAPHLAB_FILEIO_ALTERNATIVE_SSL_CERT_DIR to 
This non-commercial license of GraphLab Create is assigned to bernauer@salud.unm.edu and will expire on November 16, 2016. For commercial licensing options, visit https://dato.com/buy/.

[INFO] Start server at: ipc:///tmp/graphlab_server-26781 - Server binary: /usr/local/lib/python2.7/dist-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1451549578.log
[INFO] GraphLab Server Version: 1.7.1
In [2]:
# 2. Split data into 80% training and 20% test data.
train_data, test_data = sales.random_split(0.8, seed=0)
In [4]:
# 3. Write a generic function that accepts a column of data (e.g. an SArray)
# 'input_features' and another column 'output' and returns the Simple
# Linear Regression parameters 'intercept' and 'slope'. Use the closed
# form solution from lecture to calculate the slope and intercept.
def simple_linear_regression(input_features, output):
    sum_y = output.sum()
    sum_x = input_features.sum()
    sum_yx = (output*input_features).sum()
    sum_xx = (input_features**2).sum()
    n = float(len(output))
    slope = (sum_yx - ((sum_y*sum_x)/n))/(sum_xx - ((sum_x*sum_x)/n))
    intercept = (sum_y/n) - slope*(sum_x/n)
    return(intercept, slope)
In [5]:
# 4. Use the function to calculate the esimated slope and intercept on
# on the training data to predict 'price' given 'sqft_living'
sqft_icept, sqft_slope = simple_linear_regression(train_data['sqft_living'], train_data['price'])
print "Sqft intercept: ", sqft_icept
print "Sqft slope: ", sqft_slope
Sqft intercept:  -47116.0765749
Sqft slope:  281.958838568
In [6]:
# 5. Write a function that accepts a column of data 'input_features',
# the 'slope' and the 'intercept' you learned, and returns a column of
# predictions 'predicted_output' for each entry in the input column
def get_regression_predictions(input_features, intercept, slope):
    return intercept + slope*input_features
In [7]:
# 6. QUIZ QUESTION: Using the slope and intercept from (4), what is the
# predicted price for a house with 2650 sqft.
predicted_price = get_regression_predictions(2650, sqft_icept, sqft_slope)
predicted_price
Out[7]:
700074.8456294581
In [10]:
# 7. Write a function that accepts a column of data: 'input_features' and
# 'output' and the regression parameters 'slope' and 'intercept' and 
# and outputs the Residual Sum of Squares (RSS)
def get_residual_sum_of_squares(input_features, output, intercept, slope):
    y_hat = intercept + input_features*slope
    return ((output-y_hat)**2).sum()
In [11]:
# 8. QUIZ QUESTION: According to this function and the slope and intercept
# from (4) what is the RSS for the simple linear regression using squarefeet
# to predict prices on TRAINING data?
rss = get_residual_sum_of_squares(train_data['sqft_living'], train_data['price'], sqft_icept, sqft_slope)
print rss
1.20191835632e+15
In [12]:
# 9. Write a function that accepts a column of data 'output' and the
# regression parameters 'slope' and 'intercept' and outputs the column
# of data 'estimated_input'
def inverse_regression_predictions(output, intercept, slope):
    return (output - intercept)/float(slope)
In [14]:
# 10. QUIZ QUESTION: According to this function and the regression slope
# and intercept from (3) what is the estimated square-feet for a house
# costing $800,000?
estimate_sqft = inverse_regression_predictions(800000, sqft_icept, sqft_slope)
print estimate_sqft
3004.39624762
In [15]:
# 11. Use the function from (3) to calculate the Simple Linear Regression
# slope and intercept for estimating price based on bedrooms. Save this
# slope and intercept for later.
bedroom_icept, bedroom_slope = simple_linear_regression(train_data['bedrooms'], train_data['price'])
print "Bedroom intercept: ", bedroom_icept
print "Bedroom slope: ", bedroom_slope
Bedroom intercept:  109473.180469
Bedroom slope:  127588.952175
In [16]:
# 12. Compute RSS from both models on the TEST data
rss_sqft = get_residual_sum_of_squares(test_data['sqft_living'], test_data['price'], sqft_icept, sqft_slope)
rss_bedroom = get_residual_sum_of_squares(test_data['bedrooms'], test_data['price'], bedroom_icept, bedroom_slope)
print "RSS sqft: ", rss_sqft
print "RSS bedroom: ", rss_bedroom
RSS sqft:  2.75402936247e+14
RSS bedroom:  4.93364582868e+14
In [17]:
# 13. QUIZ QUESTION:: Which model (sqft vs bedroom) has lowest RSS on
# TEST data? Think about why this might be the case.

# Answer: SQFT has lowest RSS, this likely due to the fact that SQFT 
# is more predictive of house price than the number of bedrooms.
In [ ]: