Regularized logistic regression

In this exercise, we will implement regularized logistic regression to predict whether microchips from a fabrication plant passes quality assurance (QA). During QA, each chip goes through various tests to ensure it is functioning correctly.

Suppose you are the product manager of the factory and you have the test results for some microchips on two different tests. From these two tests you would like to determine whether the micochips should be accepted or rejected. To help you make the decision, you have a dataset of test results on past microchips, from which you can build a logistic regression model.

Visualizing the data

Before implementing any learning algorithm you should always visualize the data.

library(ggplot2)
chip_data <- read.csv("./data/ex2data2.txt", header=FALSE)
ggplot(chip_data) + geom_point(aes(x=V1, y=V2, colour=factor(V3))) + labs(title="QA: Microchip") +
  ylab("Microchip Test 1") + xlab("Microchip Test 2")

The above plot shows that our dataset cannot be separated into positive and negative examples by a straight-line through the plot. Therefore, a straightforward application of logistic regression will not perform well on this dataset since logistic regression will only be able to find a linear decision boundary.

Feature mapping

One way to fit the data better is to create more features from each data point. To acheive this we will mapp the features into all polynomial terms \(x_{1}\) and \(x_{2}\) up to the sixth power. As a result of this mapping, our vector of two features (the scores on two QA tests) will be transformed into a 28-dimensional vector. A logistic regression classifier trained on this higher-dimension feature vector will have a more complex decision boundary and will appear non-linear when drawn in our 2-dimensional plot.

While the feature mapping allows us to build a more expressive classifier, it is also more susceptible to overfitting. In the next part of this exercise we will implement regularized logistic regression to fit the data and also to see how regularization can help combat the overfitting problem.

Implementation

map_features <- function(X, degree=6){
  X1 <- X[,1]
  X2 <- X[,2]
  m <- dim(X)[1]
  out <- matrix(1, ncol=1, nrow=m)
  for(i in 1:degree){
    for(j in 0:i){
      out <- cbind(out, X1^(i-j) * X2^j)
    }
  }
 return(out)
}

Cost function and gradient

Recall that the regularized cost function for logistic regression is

\[J(\theta) = \frac{1}{m}\sum_{i=1}^{m}[-y^{(i)}log(h_{\theta}(x^{(i)})) - (1-y^{(i)})log(1-h_{\theta}(x^{(i)}))] + \frac{\lambda}{2m}\sum_{j=1}^{n}\theta_{j}^{2}\]

Note that we do not regularized the parameter \(\theta_{0}\)

The gradient for the cost function is a vector where the \(j^{th}\) element is defined as follows:

\[\frac{\partial J(\theta)}{\partial \theta_{0}} = \frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)}) - y^{(i)})x_{j}^{(i)}\]

\[\frac{\partial J(\theta)}{\partial \theta_{j}} = \Big(\frac{1}{m} \sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x^{(i)}_{j}\Big) + \frac{\lambda}{m} \theta_{j}\]

Implementation

# Sigmoid function
sigmoid_function <- function(z){
  1 / (1 + exp(-z))
}

# Cost function implementation
cost_function <- function(X, y, theta, lambda=1){
  m <- dim(X)[1]
  hx <- sigmoid_function(X %*% theta)
  J <- (1/m) * sum(-y * log(hx) - (1-y)*log(1-hx)) + (lambda/(2*m)) * t(theta) %*% theta
  return(J)
}

Now we can call cost_function using \(\theta\) values that have been initialized to all zeros. If the cost function is implemented correctly we should see a cost of about 0.693 using the ex2data2.txt data.

m <- dim(chip_data)[1]
X <- chip_data[, -3]
X <- cbind(rep(1, m), X)
X <- as.matrix(X)
n <- dim(X)[2]
y <- chip_data[,3]
theta <- matrix(0, nrow=n, ncol=1)

# Call the cost function, should return 0.693
cost_function(X, y, theta)
##           [,1]
## [1,] 0.6931472

Gradient descent

We now need to construct an algorithm tha allows us to minimzed the cost function \(J(\theta)\) with respect to parameters \(\theta\). As in previous examples, we will implement the gradient descent algorithm. However, because we are using regularized logistic regression, the gradient descent algorithm is constructed a little bit differently than the non-regularized version.

Recall that the gradient algorithm for non-regularized logistic regression is

\[\frac{\partial J(\theta)}{\partial \theta_{j}} = \frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)}) - y^{(i)})x_{j}^{(i)}\]

for all parameters including \(\theta_{0}\). In regularized logistic regression, the gradient is as follows

\[\frac{\partial J(\theta)}{\partial \theta_{j}} = \Big(\frac{1}{m} \sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x^{(i)}_{j}\Big) + \frac{\lambda}{m} \theta_{j}\]

for all \(\theta\) parameters excluding \(\theta_{0}\) which should be updated without the regularization term \(\frac{\lambda}{m}\theta_{J}\)

Implementation

gradient_descent <- function(X, y, theta, lambda=1, iters=1e4, alpha=0.0001){
  J <- rep(NA, iters)
  m <- dim(X)[1]
  for(i in 1:iters){
    # Update function for theta 0
    hx <- sigmoid_function(X %*% theta)
    theta[1,1] <- theta[1,1] - alpha * ((1/m) * t(t(hx - y) %*% X[,1]))
  
    # Update function for theta 1...j
    theta[-1,1] <- theta[-1,1] - alpha * ((1/m) * t(t(hx - y) %*% X[,-1]) + (lambda/m)*theta[-1,1])
    J[i] <- cost_function(X, y, theta, lambda=lambda)
 }
 return(list(cost_hist=J, params=theta))
}

Putting it all together

To recap we explored that data visually and determined that a linear decision boundary would not adequately separate the data into positive and negative examples. Therefore we wrote a feature mapping function which takes our input matrix of test-results and retuns a higher order features space with polynomial features up to the 6th degree. We then implemented a regularized cost function for logistic regression as well as the familiar sigmoid function. We also implemented a regularized gradient descent algorithm. We are now ready to bring it all toghether to learn the parameters using the higher-ordered mapped feature matrix.

high_order_features <- map_features(chip_data[, -3], degree=6)
m <- dim(high_order_features)[1]
high_order_features <- cbind(rep(1, m), high_order_features)
high_order_features <- as.matrix(high_order_features)
n <- dim(high_order_features)[2]
y <- chip_data[,3]
theta <- matrix(0, ncol=1, nrow=n)

iters <- 1e6
fit <- gradient_descent(high_order_features, y, theta, lambda=1, alpha=0.0001, iters=iters)
qplot(1:iters, fit$cost_hist, geom="line", main="Cost function", ylab="Cost", xlab="Iterations")

Plotting the decision boudnary

min_data <- apply(chip_data[,-3], 2, min)
max_data <- apply(chip_data[,-3], 2, max)
decision_boundary <- data.frame(x=NA, y=NA, prediction=NA)
count <- 1
for(i in seq(from=min_data[1], to=max_data[1], length.out=100)){
  for(j in seq(from=min_data[2], to=max_data[2], length.out=100)){
    mapped_out <- map_features(matrix(c(i, j), nrow=1), degree=6)
    mapped_out <- as.matrix(cbind(1, mapped_out))
    prediction <- round(sigmoid_function(mapped_out %*% fit$params))
    decision_boundary[count,] <- c(i, j, prediction)
    count <- count + 1
  }
}
ggplot(decision_boundary) + geom_point(aes(x=x, y=y, colour=factor(prediction)), alpha=0.15) + 
  geom_point(data=chip_data, aes(x=V1, y=V2, colour=factor(V3))) + labs(title="Regularized logistic regression:") +
  ylab("Microchip Test 1") + xlab("Microchip Test 2")