Regression Week 2: Multiple Linear Regression Quiz 2

Estimating Multiple Regression Coefficients (Gradient Descent)

In this first notebook we explored multiple regression using lm. Now we will implement our own functions to estimate model parameters using gradient descent.

In this notebook we will cover estimating multiple linear regression weights via gradient descent. We will:

Add a constant column of 1s to the feature data
Converte these features from a data frame to a matrix
Write a predict_output() function
Write a function to compute the derivative of the regression weights with respect to a single feature
Write gradient descent function to compute the regression weights given an initial weigth vector, step size and tolerance
Use the gradient descent function to estimate regression weights for multiple features

1. Load train and test data

train_data = read.csv('kc_house_train_data.csv', header=T, sep=",")
test_data = read.csv('kc_house_test_data.csv', header=T, sep=",")

2. Write a function that take a data set, a list of features to be used as inputs, and a name of the output (e.g. price). This function should return a features_matrix consisting of first a column of ones followed by columns containing the values of the input features in the data set in the same order as the input list, it should also return an output_array which is an array of the values of the output i the data set (e.g. price)

get_matrix_data = function(dataframe, features, output){
    dataframe$constant = 1
    features_matrix = as.matrix(dataframe[,c('constant', features)])
    output = as.matrix(dataframe[,c(output)])
    return(list(features_matrix = features_matrix, output=output))
}

3. If the features matrix (including the column of 1s for the constant) is stored as a 2D array (matrix) and the regression weights are stored as a 1D array, then the predicted output is just the dot product between the features matrix and the weights. Write a function predict_output() which accepts a 2D array feature_matrix and a 1D array weights and returns a 1D array predictions

predict_output = function(feature_matrix, weights){
    return(feature_matrix %*% weights)
}

4. If we have the values of a single input feature in an array feature and the prediction errors (predictions - output) then the derivative of the regression cost function with respect to the weight of feature is just wice the dot product between feature and errors. Write a fuction that accepts a feature array and the error array and return sthe derivative (a single number).

feature_derivative = function(errors, features){
    return(2 * (t(features) %*% errors))
}

5. Now we will use predict_output() and feature_derivative() to write a gradient descent function. Although we can compute the derivative for all the features simultaneously (the gradient) we will explicitly loop over the features individually for simplicity. Write a gradient descent function that does the following:

Accepts a feature_matrix, a 1D output array, an array of initial weights, step_size, and a convergence tolerence.
While not converged updates each feature weight by subtracting the step size times the derivative for that feature given the current weights
At each step computes the magnitude/length of the gradient (square root of the sum of squared components)
When the magnitude of the gradient is smaller than the input tolerance, return the final weight vecotr

Note: instead of using the function feature_derivative() to compute the gradient for each feature I used the formula:

\(\triangledown RSS = -2H^{T}(y-Hw)\)

Where \(H^{T}\) is the transposed feature matrix, \(y\) is the vector of outputs \(H\) is the feature matrix and \(w\) is the vector of weights. This is gradient equation is derived by taking partial derivatives with respect to each weight, \(\triangledown RSS\) is a vector of gradients evaluate for a given set of \(w\)

regression_gradient_descent = function(feature_matrix, output, initial_weights, step_size, tolerance){
    converged = FALSE
    weights = initial_weights
    while(!converged){
        predictions = predict_output(feature_matrix, weights)
        error = predictions - output
        gradient = -2*t(feature_matrix) %*% error
        gradient_magnitude = sqrt(sum(gradient**2))
        weights = weights + step_size * gradient
        if(gradient_magnitude < tolerance){
            converged = TRUE
        }
    }
    return(weights)
}

6. No we will run the regression_gradient_descent function on some acutal data. In particular we will use the gradient descent to estimate the model from Week 1 using just an intercept and slope. Use the following parameters:

Features: sqft_living
Output: price
initial_weights = [1,-47000]
step_size = 7e-12
tolerance = 2.5e7

simple_features = c('sqft_living')
my_output = 'price'
out = get_matrix_data(train_data, simple_features, my_output)
initial_weights = c(-47000, 1)
step_size = 7e-12
tolerance = 2.5e7

# Estimate weights using gradient descent
simple_weights = regression_gradient_descent(out$features_matrix, out$output, initial_weights, step_size, tolerance)
simple_weights

##                    [,1]
## constant    -46999.8872
## sqft_living    281.9121

7. QUIZ QUESTION: What is the value of the weight for sqft_living the second element of simple_weights (rounded to 1 decimal place)

## [1] 281.9

8. Now build a corresponding test_simple_feature_matrix() and test_output() function using test_data. Using test_simple_feature_matrix() and simple_weights compute the predicted house prices on all the test data.

# Test the simple_weights parameters using test_data
out = get_matrix_data(test_data, c('sqft_living'), 'price')
predictions = predict_output(out$features_matrix, simple_weights)

9. QUIZ QUESTION: What is the predicted price for the 1st house in the TEST data set for model 1 (rounded to the nearest dollar)

## [1] 356134

10. Now compute the RSS on all test data for this model. REcord the value and store it for later.

simple_rss = sum((test_data$price - predictions)**2)
simple_rss

## [1] 2.754e+14

11. Now we will use the gradient descent to fit a model with more than 1 predictor variable (and an intercept). Use the following parameters:

features = ['sqft_living', 'sqft_living15']
output = 'price'
initial_weights = [-100000, 1, 1,] (intecept, sqft_living, sqft_living_15)
step_size = 4e-12
tolerance=1e9

data = get_matrix_data(train_data, c('sqft_living', 'sqft_living15'), 'price')
initial_weights = c(-100000, 1, 1)
step_size = 4e-12
tolerance = 1e9

# Run gradient descent using parameters above
weights = regression_gradient_descent(data$features_matrix, data$output, initial_weights, step_size, tolerance)
weights

##                       [,1]
## constant      -99999.96885
## sqft_living      245.07260
## sqft_living15     65.27953

12. Use the regression weights from this second model (using sqft_living and sqft_living_15) and predict the outcome of all the house prices on TEST data.

data = get_matrix_data(test_data, c('sqft_living', 'sqft_living15'), 'price')
predictions = predict_output(data$features_matrix, weights)

13. QUIZ QUESTION: What is the predicted price for the 1st house in the TEST data set for model 2 (round to the nearest dollar)

## [1] 366651

14. What is the actual price for the 1st house in the TEST data

## [1] 310000

15. QUIZ QUESTION: Which estimate was closer to the true price for the 1st house on the TEST data set, model 1 or model 2

16. Now compute RSS on all TEST data for the second model. Record the value and store it for later

second_rss = sum((test_data$price - predictions)**2)
second_rss

## [1] 2.702634e+14

17. QUIZ QUESTION: Which model (1 or 2) has lowest RSS on all the TEST DATA

Model 2 with RSS of 2.702634e14