Linear regression with one variable

In this exercise we will be implementing linear regression with one variable to predict profits for a food truck. Suppose you are the CEO of a restaurant franchise and are considering different cities for opening a new restaurant. You would like to use data to help make a decision about which city to expand to next.

The data that we have available can be found in ex1data1.txt and includes the population of the city (1st column) and the profit of the food truck in that city (2nd column), where negative values for profit represent lost profit.

We wish to predict profits based on population, in other words regress profits onto city population.

Exploring the data

Before starting on any task, it is often useful to understand the data by visualizing it. In this example it can be done using a scatter plot since we only have two variables (profit and population).

library(ggplot2)
truck_data <- read.csv("./data/ex1data1.txt", header=FALSE)
p <- ggplot(truck_data) + geom_point(aes(x=V1, y=V2)) + xlab("Population") + ylab("Profit") + labs(title="Profit vs Population")
p

Gradient descent

In this part we will fit the linear regression parameters θ to our dataset using gradient descent.

Update equations

The objective of gradient descent is to minimize the cost function

\[J(\theta) = \frac{1}{2m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)}) - y^{(i)})^2\]

where the hypthesis function \(h_{\theta}(x)\) is given by the linear model

\[h_{\theta}(x) = \theta^{T}x = \theta_{0} + \theta_{1}x_{1}\]

The parameters of the model are the \(\theta_{j}\) values. These are the values that we will adjust to minimize the cost function. This will be done using the batch gradient descent algorithm which performs the update

\[\theta_{j} := \theta_{j} - \alpha \frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)}) - y^{(i)})x^{(i)}_{j}\]

with each step of gradient descent, your parameters θj come closer to the optimal values that will achieve the lowest cost J(θ)

Implementing the cost function

Now that we understand how the parameters are manipulated in order to minimze the cost function we should write a function to help us visualize whether or not our algorithm is converging on the minimum cost. This function must compute cost (using our cost function) given the data matrix (X) and the parameter vector (\(\theta\)).

# Function to compute cost
compute_cost <- function(X, y, theta){
  m <- dim(X)[1]
  hx = X %*% theta
  J = (1/(2*m))*sum((hx - y)^2)
  return(J)
}

Now lets test the cost function we just wrote, if the cost function is implemeted correctly it should return 32.07 using \(\theta\) = [0, 0].

# Add ones column to X for intercept parameter thata0
y <- truck_data[,2]
X <- truck_data[,1]
m <- length(y)
X <- cbind(rep(1, m), X)

theta = matrix(0, nrow=2, ncol=1)
compute_cost(X, y, theta)
## [1] 32.07273

Implementing gradient descent

We can now implement the gradient_descent algorithm. Below is a vectorized implementation.

gradient_descent <- function(X, y, theta, iters=1e4, alpha=0.0001){
  J = rep(NA, iters)
  new_theta <- matrix(0, ncol=1, nrow=2)
  m <- length(theta)
  for(i in 1:iters){
    hx <- X %*% theta
    theta <- theta - alpha*(1/m) * t(t(hx - y) %*% X)
    J[i] <- compute_cost(X, y, theta)
  }
  return(list(cost_hist=J, params=theta))
}

We can now use gradient descent to find the optimal parameters, θ, that will minimzed the sum squared error between observed and predicted values.

# Run gradient descent
out <- gradient_descent(X, y, theta)

# Plot the cost history
qplot(1:length(out$cost_hist), out$cost_hist, geom="line", xlab="Iteration", ylab="Cost", main="Cost function")

# Plot the regression line
p <- ggplot(truck_data) + geom_point(aes(x=V1, y=V2)) + geom_abline(intercept=out$params[1], slope=out$params[2], color="red") +
     xlab("Population") + ylab("Profit") + labs(title="Gradient descent: Regression line")
p

Linear regression using ggplot2

This same proceedure could have been done using only ggplot2.

p <- ggplot(truck_data) + geom_point(aes(x=V1, y=V2)) + geom_smooth(aes(x=V1, y=V2), method="lm", se=FALSE) +
     labs(title="ggplot2: Regression line") + ylab("Profit") + xlab("Population")
p 

Linear regression using R-base

We can also determine the linear regression coefficients using the lm function in R-base.

fit <- lm(V2 ~ V1, data=truck_data)
theta <- coef(fit)
p <- ggplot(truck_data) + geom_point(aes(x=V1, y=V2)) + geom_abline(intercept=theta[1], slope=theta[2], color="red") + 
     labs(title="R-base: Regression line") + ylab("Profit") + xlab("Population")
p