Artificial neural networks can be used to both regression and classification tasks. Today we're going to be working with the MNIST data of handwritten digits.1
Our first task is to get load our data and get it into the correct format for our network. Load numpy
for array multiplication.
%matplotlib inline
import numpy as np
Next we need to actually download our data and preprocess it for our neural network. Load fetch_mldata
from sklearn.datasets
and use the fetch_openml()
function to download the data.
from sklearn.datasets import fetch_openml
# download MNIST from https://www.openml.org/d/554
X, y = fetch_openml('mnist_784', version=1, return_X_y=True)
# normalize the digits; maximum value is 255
X = X/255
# print dimensionality of our data
print(X.shape)
print(y.shape)
# K
digits = 10
Getting our data into the right format for our neural network is not a trivial task. Since we have 10 digits we're predicting, our network will have 10 output nodes. We need to convert our $y \in \{0,\ldots,9\}$ categorical vector into a matrix where each row is a one of k vector. In the machine learning literature, this is referred to as a one hot encoding of the data. Use the OneHotEncoder()
function in sklearn.preprocessing
to create an encoder with sparse = False
and transform our data.
from sklearn.preprocessing import OneHotEncoder
# convert to one hot array
encoder = OneHotEncoder(sparse = False, categories='auto')
y = encoder.fit_transform(y.reshape(-1, 1))
With that taken care of, we need to split our data into training and test sets, so use train_test_split()
from sklearn.model_selection
to create a training set of 60,000 observations.
from sklearn.model_selection import train_test_split
# create 10,000 observation test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = int(6e4))
# transpose X objects for matrix multiplication later
X_train, X_test = X_train.T, X_test.T
Let's make sure our data survived this process intact, so plot the first observation in X_train
using matplotlib.pyplot
with a binary color map from matplotlib.cm
.
import matplotlib.pyplot as plt
import matplotlib.cm as cm
# plot to inspect data
plt.imshow(X_train[:, 0].reshape(28,28), cmap = cm.binary)
In a neural network each unit calculates the weighted sum of its inputs $$a_j \sum_i w_{ji}z_i$$ which is then passed through a nonlinear activation function $h(\cdot)$ so that $$z_j = h(a_j)$$
We're going to build a two layer network so the full equation is $$y_k(\mathbf{x},\mathbf{w}) = \sigma\left( \sum_{j=1}^M w_{kj} h\left( \sum_{i=1}^D w_{ji} x_i + w_{j0} \right) + w_{k0} \right)$$ where $\sigma(\cdot)$ is the softmax function, $h(\cdot)$ is the sigmoid function, and $k=\{1,\ldots,10\}$ for the ten possible digits. Load the expit()
function from scipy.special
to get access to the sigmoid function.
from scipy.special import expit as sigmoid
The output function in our network is the softmax function which takes a vector of real values and converts it to a sum to one vector of probabilities
$$f(x_k) = \frac{e^{x_k}}{\sum_{k=1}^{K} e^{x_k}}$$
Unfortunately, numpy
doesn't have a builtin version. Define one below
def softmax (x):
return(np.exp(x)/np.exp(x).sum())
We also need a way to measure the accuracy of our network at each epoch. The most common loss function in multiclass classification is the categorical cross entropy function
\begin{align} E(\mathbf{w}) &= -\sum_{n=1}^N\{t_n \ln y(\mathbf{x}_n, \mathbf{w}) + (1 - t_n) \ln (1 - y(\mathbf{x}_n, \mathbf{w}))\} \\ &= -\frac{1}{N}\sum_{i=1}^n\sum_{j=1}^m y_{ij} \ln(y(\mathbf{x}_n, \mathbf{w})) \end{align}
Fill in the code below for the function loss()
to calculate categorical cross entropy. Note that np.multiply()
computes the inner sum.
def loss(true, predicted):
L_sum = np.sum(np.multiply(true, np.log(predicted)))
L = -1 * (L_sum/true.shape[0])
return(L)
To calculate the derivatives of each of the weights, and the evaluate gradient of the error, we need to calculate the errors at each layer of the network.
The error at the output is given by $$ \delta_k = y_k - t_k $$
while the error at the hidden layer is given by
$$ \delta_j = h'(a_j) \sum_k w_{kj} \delta_k \label{deriv_hidden} $$
This lets us find the derivative of the input weights
$$ \frac{\partial E_n}{\partial w_{ji}} = \delta_jx_i $$
and the derivative of the output weights is
$$ \frac{\partial E_n}{\partial w_{kj}} = \delta_kz_j $$
keeping in mind that because we're doing batch gradient descent, each matrix operation needs to be divided by the number of observations in our data.
Finally, we update all of our weights with
$$ \mathbf{w}^{\tau + 1} = \mathbf{w}^{\tau} -\eta \frac{\partial E_n}{\partial w} $$
Now we have all the tools we need to create a neural network. Fill in the code below to complete our network function.
def network(X, y, epochs, eta):
# intialize weights and biases
W1 = np.random.randn(64, X_train.shape[0])
b1 = np.zeros((64, 1))
W2 = np.random.randn(digits, 64)
b2 = np.zeros((digits, 1))
# iterate through epochs
for e in range(epochs):
# calculate feed forward predictions
A1 = np.matmul(W1, X) + b1
Z1 = sigmoid(A1)
A2 = np.matmul(W2, Z1) + b2
Z2 = np.apply_along_axis(softmax, 0, A2)
# calculate derivatives for second weight layer
dA2 = Z2 - y.T
dW2 = np.matmul(dA2, Z1.T) / X.shape[1]
db2 = np.sum(dA2, axis=1, keepdims=True) / X.shape[1]
# calculate derivatives for first weight layer
dZ1 = np.matmul(W2.T, dA2)
dA1 = dZ1 * sigmoid(A1) * (1 - sigmoid(A1))
dW1 = np.matmul(dA1, X.T) / X.shape[1]
db1 = np.sum(dA1, axis=1, keepdims=True) / X.shape[1]
# gradient update to weights
W2 -= eta * dW2
b2 -= eta * db2
W1 -= eta * dW1
b1 -= eta * db1
# print loss every 10th iteration
if (e % 10 == 0): print('Epoch', e, 'loss:', loss(y, Z2.T))
# return weights in dictionary for prediction
return({'W1':W1, 'b1':b1, 'W2':W2, 'b2':b2})
Now it's time to train our neural network. Run the network for 100 epochs with $\eta=.1$ and watch the loss decrease! Be sure to assign the result to an object so we can hold onto those weights.
nn = network(X_train, y_train, epochs=100, eta=.1)
We can see that the loss continues to decrease every 10th iteration, but we still don't know how well our network does at predicting the data.
To anwer this question, we need to be able to make predictions from our network i.e. $y(\mathbf{x}, \mathbf{w}_{\text{network}})$ where $\mathbf{w}_{\text{network}}$ are the weights be obtain by training our network. Define a function feed_forward()
to automate the process of forward propagation below so we can easily make predictions from our network.
def feed_forward(X, weights):
A1 = np.matmul(weights['W1'], X) + weights['b1']
Z1 = sigmoid(A1)
A2 = np.matmul(weights['W2'], Z1) + weights['b2']
Z2 = np.apply_along_axis(softmax, 0, A2)
return(Z2)
Now use feed_forward()
to calculate $y(\mathbf{x}, \mathbf{w}_{\text{network}})$ and convert our one of $K$ matrix y_test
back to a categorical vector for comparison to our predictions.
predictions = np.argmax(feed_forward(X_test, nn), axis = 0)
labels = np.argmax(y_test, axis=1)
Once we've done this, we can use confusion_matrix()
and classification_report()
from sklearn.metrics
to compute a confusion matrix and produce a classification report for each digit.
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(predictions, labels))
print(classification_report(predictions, labels))
We're not doing fantastically here, and our loss decreased pretty slowly. One of the easiest ways to speed up any gradient descent algorithm is to implement stochastic gradient descent, which randomly cycles through the observations in the data and updates the weights based on results from one observation at a time. We're going to implement a compromise between stochastic and batch gradient descent known as stochastic batch gradient descent where in each epoch we shuffle the data, split it into $b$ batches, and then evaluate the network and update the weights one batch at a time.
Complete the code below to define the function network_sgd()
, which implements small batch stochastic gradient descent. Note that in the interest of clarity, this code is not robust and will break if you run it when the number of observations is not evenly divisible by the number of batches.
def network_sgd(X, y, epochs, batches, eta):
# calculate batch size
batch_size = int(X.shape[1] / batches)
# intialize weights and biases
W1 = np.random.randn(64, X.shape[0])
b1 = np.zeros((64, 1))
W2 = np.random.randn(digits, 64)
b2 = np.zeros((digits, 1))
# iterate through epochs
for e in range(epochs):
# shuffle observations
permutation = np.random.permutation(X.shape[1])
X_shuffled = X[:, permutation]
y_shuffled = y[permutation, :]
# iterate through batches
for i in range(batches):
# subset batch i
begin = i * batch_size
end = min(begin + batch_size, X.shape[1] - 1)
X_batch = X_shuffled[:, begin:end]
y_batch = y_shuffled[begin:end, :]
# calculate feed forward predictions
A1 = np.matmul(W1, X_batch) + b1
Z1 = sigmoid(A1)
A2 = np.matmul(W2, Z1) + b2
Z2 = np.apply_along_axis(softmax, 0, A2)
# calculate derivatives for second weight layer
dA2 = Z2 - y_batch.T
dW2 = np.matmul(dA2, Z1.T) / X_batch.shape[1]
db2 = np.sum(dA2, axis=1, keepdims=True) / X_batch.shape[1]
# calculate derivatives for first weight layer
dZ1 = np.matmul(W2.T, dA2)
dA1 = dZ1 * sigmoid(A1) * (1 - sigmoid(A1))
dW1 = np.matmul(dA1, X_batch.T) / X_batch.shape[1]
db1 = np.sum(dA1, axis=1, keepdims=True) / X_batch.shape[1]
# gradient update to weights
W2 -= eta * dW2
b2 -= eta * db2
W1 -= eta * dW1
b1 -= eta * db1
# print loss for entire dataset every 10th epoch
if (e % 10 == 0):
y_pred = feed_forward(X, {'W1':W1, 'b1':b1,
'W2':W2, 'b2':b2}).T
print('Epoch', e, 'loss:', loss(y, y_pred))
# return weights in dictionary
return({'W1':W1, 'b1':b1, 'W2':W2, 'b2':b2})
Now let's train our network with batch stochastic gradient descent and check out the improvements.
nn_sgd = network_sgd(X_train, y_train, epochs=100, batches = 10, eta=.1)
# caluclate predictions
predictions_sgd = np.argmax(feed_forward(X_test, nn_sgd), axis = 0)
print(confusion_matrix(predictions_sgd, labels))
print(classification_report(predictions_sgd, labels))
Each epoch takes a little longer than in batch gradient descent. Python's matrix multiplication is fast so it's more efficient to multiply a matrix with 60,000 rows by our first layer of weights than it is to multiply a matrix with 6,000 rows by our weights vector 10 times in a loop.
However, we end up with a much smaller loss after 100 epochs than we did with batch gradient descent, and our classification report looks much better. Fill in the code below to calculate the increase in correctly predicted digits.
np.diag(confusion_matrix(predictions_sgd, labels)).sum() - np.diag(confusion_matrix(predictions, labels)).sum()
The flexibility of neural networks can also be a curse. Using $1/7$ of the total data, we're able to achieve $> 50\%$ precision on all digits with only 100 training epochs. Due to their ability to approximate any continuous function, they can very easily overfit noise in our training data and do poorly on out of sample prediction. The network we've been using so far is pretty good in avoiding this behavior, but let's break a few things to illustrate the problem. Increase the number of nodes in the hidden layer.
Let's also calculate the validation loss every 10th epoch in addition to our training loss, and then store both along with the epoch number. The training loss is a non-increasing function of the epoch, but the validation loss often begins to increase after an initial decrease.
def network_val(X_train, X_test, y_train, y_test, epochs, batches, eta):
# create empty lists to hold training and validation loss
train_loss = list()
test_loss = list()
# caluclate batch size from data
batch_size = int(X_train.shape[1] / batches)
# initialize weights and biases
W1 = np.random.randn(64, X_train.shape[0])
b1 = np.zeros((64, 1))
W2 = np.random.randn(digits, 64)
b2 = np.zeros((digits, 1))
# iterate through epochs
for e in range(epochs):
# shuffle observations
permutation = np.random.permutation(X_train.shape[1])
X_shuffled = X_train[:, permutation]
y_shuffled = y_train[permutation, :]
# iterate through batches
for i in range(batches):
# subset batch i
begin = i * batch_size
end = min(begin + batch_size, X_shuffled.shape[1] - 1)
X_batch = X_shuffled[:, begin:end]
y_batch = y_shuffled[begin:end, :]
# calculate feed forward predictions
A1 = np.matmul(W1, X_batch) + b1
Z1 = sigmoid(A1)
A2 = np.matmul(W2, Z1) + b2
Z2 = np.apply_along_axis(softmax, 0, A2)
# calculate derivatives for second weight layer
dA2 = Z2 - y_batch.T
dW2 = np.matmul(dA2, Z1.T) / X_batch.shape[1]
db2 = np.sum(dA2, axis=1, keepdims=True) / X_batch.shape[1]
# calculate derivatives for first weight layer
dZ1 = np.matmul(W2.T, dA2)
dA1 = dZ1 * sigmoid(A1) * (1 - sigmoid(A1))
dW1 = np.matmul(dA1, X_batch.T) / X_batch.shape[1]
db1 = np.sum(dA1, axis=1, keepdims=True) / X_batch.shape[1]
# gradient update to weights
W2 -= eta * dW2
b2 -= eta * db2
W1 -= eta * dW1
b1 -= eta * db1
# record training and validation error every 10th epoch
if (e % 10 == 0):
# calculate predictions on training and validation data
y_pred_train = feed_forward(X_train, {'W1':W1, 'b1':b1,
'W2':W2, 'b2':b2})
y_pred_test = feed_forward(X_test, {'W1':W1, 'b1':b1,
'W2':W2, 'b2':b2})
# append training and validation loss to list
train_loss.append(loss(y_train, y_pred_train.T))
test_loss.append(loss(y_test, y_pred_test.T))
# return an array of epoch, training loss, and validation loss
return(np.column_stack([np.arange(0, epochs, 10),
np.column_stack([np.array(train_loss),
np.array(test_loss)])]))
To really illustrate this behavior, significantly increase the learning rate (remembering that $\eta \in (0,1]$), and train this network for 500 epochs.
tv = network_val(X_train, X_test, y_train, y_test, epochs=500, batches = 10, eta=.99)
Finally, let's plot the training and validation loss as a function of epoch.
# plot training error
plt.plot(tv[:,0], tv[:,1])
# plot validation error
plt.plot(tv[:,0], tv[:,2])
# add legend
plt.legend(['training error', 'validation error'], loc='upper right')
# display plot
plt.show()
We can use this property of the loss to stop training our network when the validation loss begins to increase. If we do so, we can avoid overfitting our network to the training data.