Multi-class classification and neural networks

In this exercise, we will implement one-vs-all logistic regression and neural networks to recognize hand-written digits.

Multi-class classification

We will use logistic regression and neural networks to recognize handwritten digits (from 0 to 9). Automated handwritten digit recognition is widely used today - from recognizing zip codes (postal codes) on mail envelopes to recognizing amounts written on bank checks. This exercise will walk you through these sorts of classification tasks.

Dataset

The dataset ex3data1_features.txt contains 5000 training examples of handwritten digits. Each training example is a 20x20 pixel grayscale image of the digit. Each pixel is represented by a floating point number indicating the grayscale intensity at that location. The 20x20 grid of pixels is “unrolled” into a 400 dimensional vector. Each of these training examples becomes a single row in our data matrix \(X\). This gives us a 5000x400 matrix \(X\) where every row is a training example for a handwritten digit image.

The second part of the training set is a 5000x1 dimensional vector \(y\) that contains labels for the training set. Class labels can be found in ex3data1_labs.txt.

Visualizing the data

Below we implement a function that selects 100 random images from the 5000 image training set and displays them as a single 10x10 image.

# Read in the data
X <- read.csv("ex3data1_features.txt", header=FALSE)
X <- as.matrix(X)
y <- read.csv("ex3data1_labs.txt", header=FALSE)
y <- as.matrix(y)

# Sample 100 random images
idx <- sample(5000, 100)
img <- X[idx,]

# Create blank matrix to hold all 100 images
canvas <- matrix(1, nrow=10*(20+1), ncol=10*(20+1))
count <- 1
for(i in seq(1+1, 10*(20+1), 20+1)){
  for(j in seq(1+1, 10*(20+1), 20+1)){
    canvas[i:(i+20-1), j:(j+20-1)] <- matrix(img[count,], nrow=20)
    count <- count + 1
  }
}
image(canvas, col=gray(seq(1,0,length=100)))

Vectorizing logistic regression with regularization

We will be using multiple one-vs-all logistic regression models to build a multi-class classifier. Since there rae 10 classes, we will need to train 10 separate logistic regression classifiers. To make this training efficient, it is important to ensure that the code is well vectorized. In this section, we will implement a vectorized version of logistic regression that does not employ any for loops.

Recall that the regularized cost function is \[J(\theta) = \frac{1}{m}\sum_{i=1}^{m}[-y^{(i)}log(h_{\theta}(x^{(i)})) - (1-y^{(i)})log(1-h_{\theta}(x^{(i)}))] + \frac{\lambda}{2m}\sum_{j=1}^{n}\theta_{j}^{2}\]

and the sigmoid function is \[h_{\theta}(z) = \frac{1}{1 + e^{-z}}\]

Implementation

# Sigmoid function
sigmoid_function <- function(z){
  1 / (1 + exp(-z))
}

# Vectorized cost function with regularization
cost_function <- function(X, y, theta, lambda=1){
  m <- dim(X)[1]
  hx <- sigmoid_function(X %*% theta)
  (1/m) * sum(-y * log(hx) - (1 - y) * log(1 - hx)) + (lambda/(2*m)) * sum(theta^2)
}

Vectorizing regularized gradient descent

Recall that the partial derivative of the regularized logistic regression cost function with respect to \(\theta_{j}\) is defined as \[\frac{\partial J(\theta)}{\partial \theta_{0}} = \frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)}) - y^{(i)})x_{j}^{(i)}\]

\[\frac{\partial J(\theta)}{\partial \theta_{0}} = \Big(\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)}) - y^{(i)})x_{j}^{(i)}\Big) + \frac{\lambda}{m}\theta_{j}\]

Implementation

# Gradient descent
gradient_descent <- function(X, y, theta, lambda=1, iters=1e4, alpha=0.0001){
  J <- rep(NA, iters)
  m <- dim(X)[1]
  for(i in 1:iters){
    # Update function for theta 0
    hx <- sigmoid_function(X %*% theta)
    theta[1,1] <- theta[1,1] - alpha * ((1/m) * t(t(hx - y) %*% X[,1]))
  
    # Update function for theta 1...j
    theta[-1,1] <- theta[-1,1] - alpha * ((1/m) * t(t(hx - y) %*% X[,-1]) + (lambda/m)*theta[-1,1])
    J[i] <- cost_function(X, y, theta, lambda=lambda)
 }
 return(list(cost_hist=J, params=theta))
}

The expression above allows us to compute all the partial derivatives without any loops.

One-vs-all classification

In this part of the exercise we will implement one-vs-all classification by training multiple regularized logistic regression classifiers, one fore each of the \(k\) classes in our dataset. In the handwritten digets dataset, \(k = 10\), but the code should work for any value of \(k\).

The code below will train one classifier for each class and returns all of the classifier parameters in a matrix \(\Theta\) which is Kx(n+1). Where each row of \(\Theta\) corresponds to the learned logistic regression parameters for one class. This can be done using a “for”-loop from 1 to \(k\), training each classifier independently.

When training the classifer for class \(k \in {1,...,k}\), we need \(m\)-dimensional vector of labels \(y\), where \(y_{i} \in 0,1\) indicates whether the \(j\)-th training instance belongs to class \(k (y_{j} = 1)\), or if it belongs to a different class \((y_{i} = 0)\).

Below is the code that trains a separate classifier for each class label, \(k\). in the code below we train the model on the entire example dataset, in practice this should be split into a training and test set with the training set comprising about 70% of the observations and the test set comprising the other 30% of the observations.

# Library for utilizing multiple cores
library(BiocParallel)

# Function to train the 10 classifiers, one-vs-all
one_vs_all <- function(X, y, k, iters=1e3, lambda=1, alpha=0.0001){
    m <- dim(X)[1]
    X <- cbind(rep(1, m), X)
    n <- dim(X)[2]
    theta <- matrix(0, nrow=n, ncol=1)
    train_labs <- (y == k)
    theta_params <- gradient_descent(X, train_labs, theta, iters=iters, lambda=lambda, alpha=alpha)
    return(theta_params)
}

# Time how long it takes to train the model
system.time(theta_parameters <- bplapply(1:10, function(k) one_vs_all(X, y, k, iters=1e5, lambda=1, alpha=0.0001)))

# Extract the theta parameters from the model
theta_params <- lapply(theta_parameters, function(x) x$params)

# Extract the cost function progress from the model
cost_hist <- lapply(theta_parameters, function(x) x$cost_hist)

# Save the cost function and tehta parameters to csv
cost_matrix <- as.matrix(as.data.frame(cost_hist))
theta_matrix <- as.matrix(as.data.frame(theta_params))
write.table(cost_matrix, file="cost_hist.csv", row.names=FALSE, col.names=FALSE, sep=",")
write.table(theta_matrix, file="theta.csv", row.names=FALSE, col.names=FALSE, sep=",")

One-vs-all prediction

After training the one-vs-all classifier, we can now use it to predict the digit contained in a given image. For each input, you should compute the “probability” that it belongs to each class using the trained logistic regression classifiers. The one-vs-all prediction function (implemented below) will pick the class for which the corresponding logistic regression classifer outputs the highest probability and return the class label (1,2,…, or k) as the prediction for the input example.

# Function for predicting classes of unknown examples
prediction_function <- function(train_set, theta){
  hx <- sigmoid_function(train_set %*% theta)
  # Find class with highest probability
  labels <- apply(hx, 1, function(x) which(x == max(x)))
  return(labels)
}

Now that we implemented the prediction function, let’s try it out

# Load the feature data
X <- as.matrix(read.csv("ex3data1_features.txt", header=FALSE))
m <- dim(X)[1]
X <- cbind(rep(1, m), X)
y <- read.csv("ex3data1_labs.txt", header=FALSE)
theta <- as.matrix(read.csv("theta.csv", header=FALSE, sep=","))

train_id <- sample(5000, 1000)
train_labs <- y[train_id, 1]
train_examples <- X[train_id,]
predictions <- prediction_function(train_examples, theta)
accuracy <- mean((train_labs == predictions))
accuracy * 100
## [1] 83.8

Accuracy

The overall accuracy of the model is determined by dividing the total number of correct classifications by the total number of examples in the test set.

\[Accuracy = \frac{n_{correct}}{N_{test}}\]

from the results above we can see that the accuracy of our model is 83.8 which is not bad, considering our gradient descent algorithm did not converge completely. We can observe convergence by plotting our cost history returned by gradient descnet.

require(reshape2)
## Loading required package: reshape2
require(ggplot2)
## Loading required package: ggplot2
cost_hist <- read.csv("cost_hist.csv", header=FALSE)
cost_hist <- cbind(1:dim(cost_hist)[1], cost_hist)
names(cost_hist) <- c("V0", names(cost_hist)[-1])
cost_hist_long <- melt(cost_hist, id.vars = "V0")

# Plot the cost function history for each of the classifiers
ggplot(cost_hist_long, aes(x=V0, y=value, color=variable)) + geom_line() +
  labs(title="Model Cost Function Convergence") + ylab("Cost") + xlab("Iteration")

The above plot shows the convergence of each model as parameters are adjusted by gradient descent to minimze the cost function \(J(\theta)\) we can see that if the algorithm were allowed to run longer, the algorithm would eventually converge to a minumum. Although it appears we are close to the minimum we are not quite there. As a result our model is not as accurate as it could be with an overall accuracy rate of 83.8% correct classification rate.