Neural Networks, Part I, Basic network from scratch

Introduction

I finally take the plunge and create my first neural network. I have been holding back because I wanted to create my first neural networks from scratch before using the ready-made packages like TensorFlow or PyTorch. This is so that I would develop a deeper understanding (and I should probably do the same thing for the other big algorithms I have already used, like Random Forests). I take the plunge now because I came across this tutorial by Michael Nielsen which explains everything from scratch. (Funnily, I found this tutorial indirectly, via Michael’s excellent article on Quantum Computation).

The neural network I create is almost completely vanilla: no fancy architectures, the loss function is RSS, I use the sigmoid function for the activation function. The one non-vanilla idea was to use mini-batches to estimate the gradient of the cost function, instead of using the whole training dataset. After reading through Chapter 1 of Nielsen’s tutorial and skimming through the example code, I tried to create the program from scratch. I did check back with the example code on several occasions to check I was not going astray, so my code and his example are very similar.

Testing the code

To test the code works, I created some made up easy data to be classified (see code below) and achieved the following output:

Epoch 0 starting.   Epoch 0 done. Accuracy is 0.518
Epoch 1 starting.   Epoch 1 done. Accuracy is 0.582
Epoch 2 starting.   Epoch 2 done. Accuracy is 0.893
Epoch 3 starting.   Epoch 3 done. Accuracy is 0.919
Epoch 4 starting.   Epoch 4 done. Accuracy is 0.956
Epoch 5 starting.   Epoch 5 done. Accuracy is 0.973
Epoch 6 starting.   Epoch 6 done. Accuracy is 0.966
Epoch 7 starting.   Epoch 7 done. Accuracy is 0.966
Epoch 8 starting.   Epoch 8 done. Accuracy is 0.967
Epoch 9 starting.   Epoch 9 done. Accuracy is 0.971
Accuracy on testing data: 0.968

It was satisfying to see that the code appears to work!

A proud moment

A part of doing things from scratch included deriving the back-propogation formulae. I found this trickier than I was expecting - afterall, I just have to use the Chain Rule over and over again. How hard can that be?? After straining my mind for some time, I think I have got it but am not sure. Before trying to code it up, I have a look at Nielsen’s code to check, and I got it correct. I was chuffed with myself! :D

Learning points

The main mistake I made when coding up the algorithm was not paying attention how a vector should be represented in NumPy. In particular, NumPy does not treat a rank-1 array of size (n) the same as a rank-2 array of size (1,n), for example, with transposing. This took some time to debug, because my first suspicion was that I mis-typed the formulae, or that I got the indices mixed up, or some other little error. In the end, I had to change how I coded the vectors to rank-2 arrays of size (1,n).
Nielsen often had a tidier way of coding the same steps or calculations, often by using zip. This is a useful little function which I will be sure to use in the future!

Next steps

The immediate next step is to use this code to read hand-writing using the MNIST dataset, and then work through the rest of Nielsen’s tutorial where we optimise the network in various ways. After that, the world is my oyster! At some point, I need to learn some RL, so I can continue on my AI for Games project.

The code

import numpy as np
import math
import random

class Network():
    def __init__(self, sizes):
        self.n_layers = len(sizes)
        self.sizes = sizes
        self.biases = [np.random.normal(size = (1,size)) for size in sizes[1:]]
        self.weights = [np.random.normal(size = (size1, size2))
                        for size1, size2 in zip(sizes[:-1], sizes[1:])]
    

    def feed_forward(self, a):
        b = self.biases
        w = self.weights
         
        for i in range(self.n_layers - 1):
            a = vsigmoid(np.dot(a,w[i]) + b[i])
 
        return a
    

    def sgd(self, data_train, epochs, mini_batch_size, learning_rate):
        n_dt = len(data_train)
        mbs = mini_batch_size

        for epoch in range(epochs):
            print(f"Epoch {epoch} starting.   ", end = "")
            random.shuffle(data_train)
            
            mini_batches = [data_train[k:k+mbs]
                            for k in range(0, n_dt, mbs)]

            for mini_batch in mini_batches:
                self.update_via_minibatch(mini_batch, learning_rate)

            acc = self.accuracy(data_train)
            print(f"Epoch {epoch} done. Accuracy is {acc:.3f}")
        return None
    

    def update_via_minibatch(self, mini_batch, learning_rate):
        mbs = len(mini_batch)
        
        delta_b = [np.zeros((1,size)) for size in self.sizes[1:]]
        delta_w = [np.zeros((size1, size2))
                    for size1, size2 in zip(self.sizes[:-1], self.sizes[1:])]

        for x,y in mini_batch:
            db, dw = self.backprop(x,y)
            delta_b = [b1 + b2 for b1,b2 in zip(delta_b, db)]
            delta_w = [w1 + w2 for w1,w2 in zip(delta_w, dw)]
        
        self.biases = [b - (learning_rate/mbs)*db
                        for b, db in zip(self.biases, delta_b)]
        
        self.weights = [w - (learning_rate/mbs)*dw
                        for w, dw in zip(self.weights, delta_w)]
        return None


    def backprop(self, x, y):
        # introduce shorthand notation for weights and biases
        w = self.weights
        b = self.biases

        # feedforward. store values of a and z
        a_temp = x
        z_temp = x
        a = [x]
        z = [x]

        for i in range(self.n_layers - 1):
            z_temp = np.dot(a_temp, w[i]) + b[i]
            a_temp = vsigmoid(z_temp)
            z.append(z_temp)
            a.append(a_temp)
        
        # define variables to store gradients
        grad_a = [None for _ in a]
        grad_z = [None for _ in z]
        grad_b = [None for _ in b]
        grad_w = [None for _ in w]
        n = self.n_layers

        # initialise gradients for a and z in final layer
        grad_a[n-1] = 2*(a[n-1]-y)
        temp = vsigmoid_prime(z[n-1])*grad_a[n-1]
        grad_z[n-1] = temp

        # back propogate
        for i in range(n-2,-1,-1):
            grad_b[i] = grad_z[i+1]
            grad_w[i] = np.dot(np.transpose(a[i]), grad_z[i+1])
            grad_a[i] = np.dot(grad_z[i+1], np.transpose(w[i]))
            grad_z[i] = vsigmoid_prime(z[i])*grad_a[i]

        return grad_b, grad_w


    def accuracy(self, data_test):
        acc = 0
        for x,y in data_test:
            y_hat = self.feed_forward(x)
            match = (np.argmax(y_hat) == np.argmax(y))
            acc += int(match)
        return acc / len(data_test)




def sigmoid(z):
    return 1/(1+math.exp(-z))

def sigmoid_prime(z):
    return (math.exp(-z))/((1+math.exp(-z))**2)

vsigmoid = np.vectorize(sigmoid)
vsigmoid_prime = np.vectorize(sigmoid_prime)

def test():
    net = Network([2,3,4,5])
    
    data_train = []
    for _ in range(1000):
        if random.randint(0,1) == 0:
            x = np.random.normal(loc = 0.6, scale = 0.15, size = (1,2))
            y = np.array([1,0,0,0,0])
        else:
            x = np.random.normal(loc = 0.2, scale = 0.15, size = (1,2))
            y = np.array([0,0,0,0,1])
        data_train.append((x,y))

    net.sgd(data_train, 10, 50, 3.0)

    data_test = []
    for _ in range(1000):
        if random.randint(0,1) == 0:
            x = np.random.normal(loc = 0.6, scale = 0.15, size = (1,2))
            y = np.array([1,0,0,0,0])
        else:
            x = np.random.normal(loc = 0.2, scale = 0.15, size = (1,2))
            y = np.array([0,0,0,0,1])
        data_test.append((x,y))

    print(f"Accuracy on testing data: {net.accuracy(data_test)}")

test()

Other posts in series