Estimating Multiple Regression Coefficients (Gradient Descent)
In this first notebook we explored multiple regression using lm
.
Now we will implement our own functions to estimate model parameters
using gradient descent.
In this notebook we will cover estimating multiple linear regression weights via gradient descent. We will:
predict_output()
function1. Load train and test data
train_data = read.csv('kc_house_train_data.csv', header=T, sep=",")
test_data = read.csv('kc_house_test_data.csv', header=T, sep=",")
2. Write a function that take a data set, a list of features to be used as inputs,
and a name of the output (e.g. price
). This function should return a features_matrix
consisting of first a column of ones followed by columns containing the values of the
input features in the data set in the same order as the input list, it should also
return an output_array
which is an array of the values of the output i the data set
(e.g. price
)
get_matrix_data = function(dataframe, features, output){
dataframe$constant = 1
features_matrix = as.matrix(dataframe[,c('constant', features)])
output = as.matrix(dataframe[,c(output)])
return(list(features_matrix = features_matrix, output=output))
}
3. If the features matrix (including the column of 1s for the constant) is stored
as a 2D array (matrix) and the regression weights are stored as a 1D array, then the
predicted output is just the dot product between the features matrix and the weights.
Write a function predict_output()
which accepts a 2D array feature_matrix
and a
1D array weights
and returns a 1D array predictions
predict_output = function(feature_matrix, weights){
return(feature_matrix %*% weights)
}
4. If we have the values of a single input feature in an array feature
and the prediction
errors
(predictions - output) then the derivative of the regression cost function with
respect to the weight of feature
is just wice the dot product between feature
and errors
.
Write a fuction that accepts a feature
array and the error
array and return sthe derivative
(a single number).
feature_derivative = function(errors, features){
return(2 * (t(features) %*% errors))
}
5. Now we will use predict_output()
and feature_derivative()
to write a gradient descent
function. Although we can compute the derivative for all the features simultaneously (the
gradient) we will explicitly loop over the features individually for simplicity. Write a
gradient descent function that does the following:
feature_matrix
, a 1D output array, an array of initial weights, step_size
,
and a convergence tolerence.Note: instead of using the function feature_derivative()
to compute the gradient for each feature I
used the formula:
\(\triangledown RSS = -2H^{T}(y-Hw)\)
Where \(H^{T}\) is the transposed feature matrix, \(y\) is the vector of outputs \(H\) is the feature matrix and \(w\) is the vector of weights. This is gradient equation is derived by taking partial derivatives with respect to each weight, \(\triangledown RSS\) is a vector of gradients evaluate for a given set of \(w\)
regression_gradient_descent = function(feature_matrix, output, initial_weights, step_size, tolerance){
converged = FALSE
weights = initial_weights
while(!converged){
predictions = predict_output(feature_matrix, weights)
error = predictions - output
gradient = -2*t(feature_matrix) %*% error
gradient_magnitude = sqrt(sum(gradient**2))
weights = weights + step_size * gradient
if(gradient_magnitude < tolerance){
converged = TRUE
}
}
return(weights)
}
6. No we will run the regression_gradient_descent
function on some acutal data. In
particular we will use the gradient descent to estimate the model from Week 1 using just
an intercept and slope. Use the following parameters:
sqft_living
price
initial_weights = [1,-47000]
step_size = 7e-12
tolerance = 2.5e7
simple_features = c('sqft_living')
my_output = 'price'
out = get_matrix_data(train_data, simple_features, my_output)
initial_weights = c(-47000, 1)
step_size = 7e-12
tolerance = 2.5e7
# Estimate weights using gradient descent
simple_weights = regression_gradient_descent(out$features_matrix, out$output, initial_weights, step_size, tolerance)
simple_weights
## [,1]
## constant -46999.8872
## sqft_living 281.9121
7. QUIZ QUESTION: What is the value of the weight for sqft_living
the second element of simple_weights
(rounded to 1 decimal place)
## [1] 281.9
8. Now build a corresponding test_simple_feature_matrix()
and test_output()
function using test_data
. Using
test_simple_feature_matrix()
and simple_weights
compute the predicted house prices on all the test data.
# Test the simple_weights parameters using test_data
out = get_matrix_data(test_data, c('sqft_living'), 'price')
predictions = predict_output(out$features_matrix, simple_weights)
9. QUIZ QUESTION: What is the predicted price for the 1st house in the TEST data set for model 1 (rounded to the nearest dollar)
## [1] 356134
10. Now compute the RSS on all test data for this model. REcord the value and store it for later.
simple_rss = sum((test_data$price - predictions)**2)
simple_rss
## [1] 2.754e+14
11. Now we will use the gradient descent to fit a model with more than 1 predictor variable (and an intercept). Use the following parameters:
features = ['sqft_living', 'sqft_living15']
output = 'price'
initial_weights = [-100000, 1, 1,]
(intecept, sqft_living, sqft_living_15)step_size = 4e-12
tolerance=1e9
data = get_matrix_data(train_data, c('sqft_living', 'sqft_living15'), 'price')
initial_weights = c(-100000, 1, 1)
step_size = 4e-12
tolerance = 1e9
# Run gradient descent using parameters above
weights = regression_gradient_descent(data$features_matrix, data$output, initial_weights, step_size, tolerance)
weights
## [,1]
## constant -99999.96885
## sqft_living 245.07260
## sqft_living15 65.27953
12. Use the regression weights from this second model (using sqft_living
and sqft_living_15
) and predict
the outcome of all the house prices on TEST data.
data = get_matrix_data(test_data, c('sqft_living', 'sqft_living15'), 'price')
predictions = predict_output(data$features_matrix, weights)
13. QUIZ QUESTION: What is the predicted price for the 1st house in the TEST data set for model 2 (round to the nearest dollar)
## [1] 366651
14. What is the actual price for the 1st house in the TEST data
## [1] 310000
15. QUIZ QUESTION: Which estimate was closer to the true price for the 1st house on the TEST data set, model 1 or model 2
16. Now compute RSS on all TEST data for the second model. Record the value and store it for later
second_rss = sum((test_data$price - predictions)**2)
second_rss
## [1] 2.702634e+14
17. QUIZ QUESTION: Which model (1 or 2) has lowest RSS on all the TEST DATA
Model 2 with RSS of 2.702634e14