#collapse-hide
import numpy as np
This notebook demonstrates implementing a neural network from scratch with Python without any libraries except Numpy. The choice for Python, rather than a fast low-level language like C, is made on the premise that this serves as learning material. The implementation is presented sequentially following the train of thought when building a neural network.
Introducing the Neural Network
Before going through this notebook, if you are a beginner, it can be useful to storm through this video by 3Blue1Brown, which will offer some visual intuition before you attack the text and code below:
youtube: https://youtu.be/aircAruvnKk
The word neural in “neural network” comes from neurobiology from where the neural network architecture was loosely inspired, and the word ‘network’ is used because it encompasses multiple connected functions inside it resembling a network. In the process of explaining and constructing the neural network, we will be solving a supervised task. A supervised task is a task where there is a dataset of paired inputs and outputs:
\[\left\{\left(x_{i}, y_{i}\right)\right\}_{i=1}^{N}\]
With the use of neural networks, these paired inputs and outputs are assumed to have functional dependence which we don’t know and are trying to approximate with it.
For example, later in this blog, we will try to teach/train a neural network to recognize does an image (input) represent a cat or not (output), therefore we assume that there is functional dependence between the pixels in the image and the output classes ‘cat’ and ‘not-cat’.
What our neural network will then help us do is find that function or its close approximate. Therefore we come to an important conclusion:
Important: Neural networks are function approximation machines.
A neural network is said to be able to approximate any function at all. More precisely, no matter what the function f is, there is guaranteed to be a neural network so that for every possible input x the neural network outputs f(x) or some close approximate.
Tip: This theorem is called “The Universal Approximation Theorem”
The real challenge is finding such a network and specializing it for our task.
Neural networks can approximate functions by being itself a chain of functions with adjustable parameters.
\[y=f_{NN}(x)=f_{L}\left(\ldots f_{2}\left(f_{1}(x)\right)\right)\]
Individual functions in this equation are called layers, each representing a vector-to-vector operation. To better understand the neural network building blocks we can think of layers as consisting out of multiple simple vector-to-scalar computational units called neurons. These neurons are connected to each of the outputs of the preceding layer and calculate a weighted sum and apply an activation function on those inputs which produces an output.
Where:
- \(x_i\) is a real number called input
- \(\omega_i\) is a real number called weight
- \(\varphi\) is a function applied to the weighted sum called an activation function
- \(b\) is a real number called bias
- \(y\) is a real number called the output of a neuron
\(\omega\) and \(b\) are called the parameters of the neural network.
Mathematical calculations of a whole layer of neurons can be represented with the following formula:
\[f_{l}(z)=\varphi_{l}\left(\mathbf{W}_{l} \mathbf{y}_{l-1}+\mathbf{b}_{l}\right)\]
Where:
- \(\mathbf{W}_l\) is a matrix of weights for the current layer
- \(\mathbf{y}_{l-1}\) is a vector of inputs to this layer
- \(\mathbf{b}_l\) is a vector of biases
By connecting multiple layers, we get the feedforward neural network. An example of a neural network is presented below. The first layer is called the input layer and the last one is called the output layer. Intermediary layers are called hidden layers.
Building a neural network includes selecting its parameters as well as hyperparameters.
Selecting hyperparameters means deciding its architecture in terms of the number of layers, number of neurons in each layer, activation functions, etc. These manage the possibility of a neural network to approximate a function and are usually selected manually.
Parameters of a neural network are on the other hand represented by weight matrices \(\mathbf{W}_l\) and bias vectors \(\mathbf{b}_l\) for each layer. Parameters control what a neural network does and are exactly what the process referred to as neural network training is trying to optimize.
Important: The ability of a neural network to represent a variety of functions is dependant on its hyperparameters while parameters control how exactly the approximation is done.
For each neural network a cost function needs to be defined. A cost function takes in the output of the neural network and evaluates its performance on the dataset. After the evaluation of a neural network its parameters can be adjusted to produce a better output with the algorithm called backpropagation.
Constructing a Neural Network
Parameter Initialization
First, we have to decide the number of layers and their dimensions. We will allow this to be adjustable in code by a function accepting an array of integers, each specifying the number of neurons in the layer associated with its index. Then we need to initialize the parameters
The parameters of a neural network need to be initialized before training. Turns out initializing them all to zero is not the right way. This is because the backpropagation algorithm can’t work if all parameters have the same values. It is said that the parameters need to ‘break symmetry’. That is why here we will initialize weights to random small values and biases can be left at zero. Also, other problems may arise when initialization is not carefully handled but the details of parameter initialization techniques are outside the scope of this post.
Parameters are used in later steps so we store them in a dictionary
def initialize_neural_network(layer_dims):
= {}
parameters = len(layer_dims)
layer_count
for layer in range(1, layer_count):
'w' + str(layer)] = np.random.randn(layer_dims[layer], \
parameters[- 1]) / np.sqrt(layer_dims[layer - 1])
layer_dims[layer 'b' + str(layer)] = np.zeros((layer_dims[layer], 1))
parameters[
return parameters
Forward Propagation
The forward propagation is synonymous with applying the neural network to a set of inputs that produces an output. This means we need to apply the layer functions. Each layer function is composed of two parts. First part is called the linear forward step (\(z^{[l]}\)) where each neuron calculates a weighted sum of all the inputs governed by our parameters. The second part is called the activation forward step (\(a^{[l]}\)) where a selected activation function (\(g^{[l]}\)) is applied to the output of the linear step and is sent to the next layer or the output. We can visualize these steps for a neural network by showing a computation graph. An example computation graph for a neural network with three layers (\(L=3\)) is shown below.
Where:
- \(a^{[0]}\) is the input to the network, synonymous to X
- \(z^{[l]}\) is the output of a linear forward step for layer \(l\)
- \(a^{[l]}\) is the output of an activation forward step for layer \(l\)
- \(a^{[L]}\) is the output of the network
Linear forward prop step is implemented as:
def linear_forward_prop_step(a_prev,w,b):
= np.dot(w, a_prev) + b
z return z
Now we need to implement the activation function. In as few words as possible, an activation function is used so the neural network can represent a non-linear representation. The default activation function used in hidden layers of neural networks is the ReLU function. ReLU is a simple function which maps a value to itself if it’s positive, else it maps it to zero. ReLU is a good default choice because it has a stable gradient which is useful in the backpropagation algorithm. Since in this blog we are doing binary classification (is image an image of a cat), a default output layer activation function is the sigmoid function which maps values to a 0 to 1 range. The output will therefore correspond to how confident is the network that the image represents a cat. These activation functions are shown below.
Two mentioned activation functions are implemented as:
def sigmoid(z):
= 1/(1+np.exp(-z))
a return a
def relu(z):
=np.maximum(0,z)
areturn a
We will let the user choose the activation by passing in its name in the activation forward prop step:
def activation_forward_prop_step(z,activation_name):
if(activation_name=='sigmoid'):
=sigmoid(z)
areturn a
elif(activation_name=='relu'):
=relu(z)
areturn a
else:
raise('Activation not supported')
Stepping through a single layer is than defined as:
def forward_prop_step(a_prev,w,b,activation_name):
=linear_forward_prop_step(a_prev,w,b)
z=activation_forward_prop_step(z,activation_name)
areturn z,a
Notice how we implemented everything modularly so we can easily generalize our calculations to different architectures as we will do right now.
Below is the implementation of the whole forward propagation to which we pass the input, parameters from the neural network initialization function and an array of activation names we use for each layer in turn. For backward propagation, mentioned later, we need to store all linear and activation outputs of layers in a cache. The input to the neural network can also be thought of as an activation so it is also stored as such.
def forward_prop(X, parameters, activation_names):
= len(parameters) // 2
layer_count = X
a
= {}
forward_prop_cache 'a0']=X
forward_prop_cache[
for layer in range(1, layer_count + 1):
= a
a_prev = forward_prop_step(a_prev,
z, a 'w' + str(layer)],
parameters['b' + str(layer)],
parameters[=activation_names[layer - 1])
activation_name
'z' + str(layer)] = z
forward_prop_cache['a' + str(layer)] = a
forward_prop_cache[
return forward_prop_cache
Cost calculation
A neural network produces some output for some input. If a neural network would be inferenced after parameter initialization the output would be gibberish because the neural network is not trained yet, but that’s okay! To train our network with the backpropagation algorithm, we need a way to measure its performance. A cost function is used to measure performance by comparing the output/predictions of a neural network to the desired outputs shown in the dataset for a given sample. The cost function described here is the cross-entropy cost function.
\[J=-\frac{1}{m} \sum_{i=1}^{m}\left(y^{(i)} \log \left(a^{[L](i)}\right)+\left(1-y^{(i)}\right) \log \left(1-a^{[L](i)}\right)\right)\]
Cost calculation can be thought of as an additional block in the computation graph:
def compute_cost(y_pred, y):
= y.shape[1]
m
= (-1 / m) * np.sum(np.multiply(y, np.log(y_pred)) + np.multiply(1 - y, np.log(1 - y_pred)))
cost
= np.squeeze(cost)
cost
return cost
Backward Propagation
Backpropagation algorithm is an efficient method of computing gradients for the purpose of minimizing the cost function. In order to enable our neural network to ‘learn’ by minimizing the cost function, we need to alter our parameters. We can find out how we should alter our parameters by computing partial derivatives of the cost function with respect to each of our parameters. In other words, they will tell us approximately for each of our parameters how will the cost function behave if we increase or decrease them. Partial derivatives are calculated starting from the last layer to the first. This can be visualized with a computation graph.
Calculating these partial derivatives includes applying the chain rule which yields the following expressions:
- Partial derivatives for the parameters of the third layer with the respect to the cost function
- Partial derivatives for the parameters of the second layer with the respect to the cost function
- Partial derivatives for the parameters of the first layer with the respect to the cost function
In order to be able to efficiently calculate partial derivates for any neural network architecture we can precalculate a few repeating expressions. For details of how these equations came about, check Deriving And Implementing Backpropagation {% fn 2 %}.
Note: Pay attention to how we chain these equations one into the next.
- Partial derivative of the cost function with respect to the last activation forward prop step (output) \(a^{[L]}\):
\[ \frac{\partial {C} }{\partial a^{[L]}} = - \left(\frac{y}{a^{[L]}}-\frac{1-y}{1-a^{[L]}}\right) \]
def cost_backprop_step(y_pred,y,cost_name):
if(cost_name=='entropy'):
= - (np.divide(y, y_pred) - np.divide(1 - y, 1 - y_pred))
dC_da
return dC_da
- Partial derivative of the cost function with respect to the linear forward prop step is:
\[ \frac{\partial {C} }{\partial z^{[l]}} = \frac{\partial {C} }{\partial a^{[l]}} g'(z^{[l]}) \]
- where \(g'(z^{[l]})\) is the partial derivative of the activation function with respect to the input of the activation function \(z^{[l]}\):
\[ \sigma^{\prime}(z^{[l]}) = \sigma (z^{[l]}) (1-\sigma(z^{[l]}))\]
\[ ReLU'(z^{[l]})=\begin{cases}1, & \text{if}\ z^{[l]}>0 \\ 0, & \text{otherwise}\end{cases} \]
def activation_backward_prop_step(dC_da,z,activation_name):
if(activation_name=='sigmoid'):
=dC_da*sigmoid_prime(z)
dC_dzassert (dC_dz.shape == z.shape)
return dC_dz
elif(activation_name=='relu'):
=dC_da*relu_prime(z)
dC_dzassert (dC_dz.shape == z.shape)
return dC_dz
def sigmoid_prime(z):
= 1.0/(1.0+np.exp(-z))
sigmoid return sigmoid * (1.0-sigmoid)
def relu_prime(z):
return np.where(z>0, 1.0, 0.0)
- Partial derivatives of our cost function with respect to parameters \(w^{[l]}\), \(b^{[l]}\) and the activation \(a^{[l-1]}\):
\[ \frac{\partial {C} }{\partial w^{[l]}} = \frac{1}{m} \frac{\partial {C} }{\partial z^{[l]}} a^{[l-1] T} \]
\[ \frac{\partial {C} }{\partial b^{[l]}} = \frac{1}{m} \sum_{i = 1}^{m} \frac{\partial {C} }{\partial z^{[l]}}^{(i)}\]
\[ \frac{\partial {C} }{\partial a^{[l-1]}} = {w^{[l]}}^{T} \frac{\partial {C} }{\partial z^{[l]}} \]
def linear_backward_prop_step(dC_dz,a_prev,w):
=a_prev.shape[1]
m
=(1/m)* np.dot(dC_dz, a_prev.T)
dC_dw = 1. / m * np.sum(dC_dz, axis=1, keepdims=True)
dC_db = np.dot(w.T, dC_dz)
dC_da_prev
assert (dC_da_prev.shape == a_prev.shape)
assert (dC_dw.shape == w.shape)
return dC_da_prev, dC_dw, dC_db
Lets refactor the whole backward step through a single layer like:
def backward_prop_step(dC_da,a_prev,z,w,activation_name):
=activation_backward_prop_step(dC_da,z,activation_name)
dC_dz=linear_backward_prop_step(dC_dz,a_prev,w)
dC_da_prev,dC_dw,dC_db
return dC_da_prev, dC_dw, dC_db
Now we can describe the backpropagation of the whole network.
def backward_prop(y_pred, y, cost_name, parameters, forward_prop_cache, activation_names):
= {}
backward_prop_cache
= len(parameters) // 2
L
= cost_backprop_step(y_pred, y, cost_name)
dC_daL 'dC_da' + str(L)] = dC_daL
backward_prop_cache[
for l in reversed(range(1, L + 1)):
= backward_prop_cache['dC_da' + str(l)]
dC_da
= parameters['w' + str(l)]
w
= forward_prop_cache['z' + str(l)]
z = forward_prop_cache['a' + str(l-1)]
a_prev
= backward_prop_step(dC_da, a_prev, z, w, activation_names[l - 1])
dC_da_prev, dC_dw, dC_db
'dC_dw' + str(l)] = dC_dw
backward_prop_cache['dC_db' + str(l)] = dC_db
backward_prop_cache[
if (l != 1):
'dC_da' + str(l - 1)] = dC_da_prev
backward_prop_cache[
return backward_prop_cache
Parameter Update
After all gradients are found the parameters can be updated by altering their value in the direction that would lower the cost function. Since the gradient shows us how much the function is expected to increase, for a unit alteration of the variable, we need to subtract the gradient multiplied by something called the learning rate. Learning rate decides what fraction of the gradient we are applying to the parameters.
\[w^{[l]}=w^{[l]}-\alpha \frac{\partial C}{\partial w^{[l]}}\] \[b^{[l]}=b^{[l]}-\alpha \frac{\partial C}{\partial b^{[l]}}\]
def update_parameters(parameters, backward_prop_cache, learning_rate):
= len(parameters) // 2
L
for l in range(L):
"w" + str(l + 1)] = parameters["w" + str(l + 1)] - learning_rate * backward_prop_cache[
parameters["dC_dw" + str(l + 1)]
"b" + str(l + 1)] = parameters["b" + str(l + 1)] - learning_rate * backward_prop_cache[
parameters["dC_db" + str(l + 1)]
return parameters
Applying a neural network
In this example a neural network is applied to solve a computer vision binary classification task. The task is recognizing cats in images. A dataset used consists out of 259 images with 105 of them being cat images and the rest 154 being non-cat images
Dataset preprocessing
Warning: If you are following this in google colab, than you must download the dataset by expanding the next cell and following the instructions!
#collapse-hide
# If you are viewing this on google colab follow these instructions to acquire the dataset:
# - If you don't have a kaggle account make one (its simple)
# - Go to your profile -> My account -> Scroll to API -> Create New API Token. kaggle.json file should be downloaded now to your computer
# - Set the variable viewing_this_on_colab in this cell to True
# - Execute the cell and when prompted upload your kaggle.json file
=False
viewing_this_on_colab
import time
import h5py
import matplotlib.pyplot as plt
import scipy
from PIL import Image
from scipy import ndimage
if(viewing_this_on_colab):
!pip install -q kaggle
from google.colab import files
files.upload()!mkdir ~/.kaggle/
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!kaggle datasets download -d muhammeddalkran/catvnoncat
!unzip catvnoncat.zip
='/content/catvnoncat/train_catvnoncat.h5'
train_catvnoncat_path='/content/catvnoncat/test_catvnoncat.h5'
test_catvnoncat_path
else:
='../tmp/2020-10-14/data/train_catvnoncat.h5'
train_catvnoncat_path='../tmp/2020-10-14/data/test_catvnoncat.h5'
test_catvnoncat_path
= h5py.File(train_catvnoncat_path,"r")
train_dataset = h5py.File(test_catvnoncat_path, "r")
test_dataset
= np.array(train_dataset["train_set_x"][:])
train_x = np.array(train_dataset["train_set_y"][:])
train_y = train_y.reshape((1, train_y.shape[0]))
train_y print("train X shape:{} , train y shape:{}".format(train_x.shape,train_y.shape))
= np.array(test_dataset["test_set_x"][:])
test_x= np.array(test_dataset["test_set_y"][:])
test_y = test_y.reshape((1, test_y.shape[0]))
test_y print("test X shape:{} , test y shape:{}".format(test_x.shape,test_y.shape))
= np.array(test_dataset["list_classes"][:])
classes print('classes: {}'.format(classes))
train X shape:(209, 64, 64, 3) , train y shape:(1, 209)
test X shape:(50, 64, 64, 3) , test y shape:(1, 50)
classes: [b'non-cat' b'cat']
An example of a cat and a non-cat image is shown below:
#collapse-hide
=plt.figure()
f=f.add_subplot(121)
ax1=f.add_subplot(122)
ax2
=2
index
ax1.imshow(train_x[index])=5
index
ax2.imshow(train_x[index])
plt.show()
Now we flatten and normalize the pictures for the neural network
=train_x.reshape(train_x.shape[0],-1).T /255.
train_x_norm=test_x.reshape(test_x.shape[0],-1).T /255.
test_x_norm
print('train_x_norm_shape: {}, test_x_norm_shape: {}'.format(train_x_norm.shape,test_x_norm.shape))
train_x_norm_shape: (12288, 209), test_x_norm_shape: (12288, 50)
Neural network training
When training a neural network the iteration count should be chosen. Iteration count determines how many times the neural network is going to evaluate itself and update its parameters.
1)
np.random.seed(
def NeuralNetworkTrain(X, y, layer_dims, learning_rate, num_iterations, activation_names, cost_name):
= []
costs = initialize_neural_network(layer_dims)
parameters =len(parameters)//2
layer_count
for i in range(0, num_iterations):
= forward_prop(X, parameters, activation_names)
forward_prop_cache
= forward_prop_cache['a'+str(layer_count)]
y_pred = compute_cost(y_pred, y)
cost = backward_prop(y_pred, y, cost_name, parameters, forward_prop_cache,\
backward_prop_cache
activation_names)= update_parameters(parameters, backward_prop_cache, learning_rate)
parameters
if i % 100 == 0:
costs.append(cost)print('Iteration:{}, Cost={}'.format(i, cost))
plt.plot(np.squeeze(costs))
plt.show()
return parameters
Now lets choose our hyperparmeters and train our network!
=['relu','relu','relu','sigmoid']
activation_names=[12288,20,7,5,1]
layer_dims=0.0075
learning_rate=3000
num_iterations='entropy'
cost_name
=NeuralNetworkTrain(train_x_norm,
parameters
train_y,
layer_dims,
learning_rate,
num_iterations,
activation_names, cost_name)
Iteration:0, Cost=0.7717493284237686
Iteration:100, Cost=0.6720534400822913
Iteration:200, Cost=0.6482632048575212
Iteration:300, Cost=0.6115068816101354
Iteration:400, Cost=0.5670473268366111
Iteration:500, Cost=0.54013766345478
Iteration:600, Cost=0.5279299569455268
Iteration:700, Cost=0.46547737717668525
Iteration:800, Cost=0.36912585249592816
Iteration:900, Cost=0.39174697434805333
Iteration:1000, Cost=0.3151869888600618
Iteration:1100, Cost=0.27269984417893894
Iteration:1200, Cost=0.23741853400268134
Iteration:1300, Cost=0.1996012053220864
Iteration:1400, Cost=0.1892630038846331
Iteration:1500, Cost=0.1611885466582775
Iteration:1600, Cost=0.1482138966236331
Iteration:1700, Cost=0.13777487812972944
Iteration:1800, Cost=0.12974017549190114
Iteration:1900, Cost=0.12122535068005202
Iteration:2000, Cost=0.11382060668633705
Iteration:2100, Cost=0.1078392852625413
Iteration:2200, Cost=0.10285466069352676
Iteration:2300, Cost=0.10089745445261789
Iteration:2400, Cost=0.09287821526472394
Iteration:2500, Cost=0.08841251177615041
Iteration:2600, Cost=0.08595130416146413
Iteration:2700, Cost=0.0816812691492633
Iteration:2800, Cost=0.07824661275815532
Iteration:2900, Cost=0.07544408693855478
Result evaluation
The neural network is then evaluated on the training and test set by measuring accuracy, recall, precision and the F1 score
def print_results(X,y,parameters,activation_names,threshold=0.5):
= X.shape[1]
m = len(parameters) // 2
n = np.zeros((1,m))
p
=0,0,0,0
TP,TN,FP,FN
= forward_prop(X, parameters, activation_names)
forward_prop_cache =forward_prop_cache['a'+str(n)]
y_pred
for i in range(0, y_pred.shape[1]):
if y_pred[0,i] > threshold:
0,i] = 1
p[if(y[0,i]==1):
=TP+1
TPelse:
=FP+1
FP
else:
0,i] = 0
p[if(y[0,i]==0):
=TN+1
TNelse:
=FN+1
FN
print('question is: is it a cat?')
print()
print(' ACTUALLY TRUE ACTUALLY FALSE')
print('PREDICTED TRUE TP: {} FP: {}'.format(TP,FP))
print('PREDICTED FALSE FN: {} TN: {}'.format(FN,TN))
print()
print('accuracy: {}'.format((TP+TN)/(TP+TN+FP+FN)))
print('precision: {}'.format(TP/(TP+TN)))
print('recall: {}'.format(TP/(TP+FN)))
print('f1: {}'.format((TP / ( (TP+( (FN+FP) / 2) )))))
print()
print('-----TRAINING SET RESULTS-----')
=['relu','relu','relu','sigmoid'])
print_results(train_x_norm,train_y,parameters, activation_names
print('-----TEST SET RESULTS-----')
=['relu','relu','relu','sigmoid']) print_results(test_x_norm,test_y,parameters, activation_names
-----TRAINING SET RESULTS-----
question is: is it a cat?
ACTUALLY TRUE ACTUALLY FALSE
PREDICTED TRUE TP: 70 FP: 0
PREDICTED FALSE FN: 2 TN: 137
accuracy: 0.9904306220095693
precision: 0.33816425120772947
recall: 0.9722222222222222
f1: 0.9859154929577465
-----TEST SET RESULTS-----
question is: is it a cat?
ACTUALLY TRUE ACTUALLY FALSE
PREDICTED TRUE TP: 31 FP: 7
PREDICTED FALSE FN: 2 TN: 10
accuracy: 0.82
precision: 0.7560975609756098
recall: 0.9393939393939394
f1: 0.8732394366197183
Lets predict the class of a random picture and show the result:
def predict(x,parameters,activation_names,threshold=0.5):
= len(parameters) // 2
n = forward_prop(x, parameters, activation_names)
forward_prop_cache =forward_prop_cache['a'+str(n)]
y_predprint(y_pred)
#collapse-hide
=np.random.choice(range(50))
index=test_x[index,:]
img plt.imshow(img)
<matplotlib.image.AxesImage at 0x7fb683b27670>
#collapse-hide
=test_x_norm[:,index]
img_flatten=np.expand_dims(img_flatten, axis=1)
img_flatten
print('Predicted percentage of it being a cat:')
predict(img_flatten,parameters,activation_names)
Predicted percentage of it being a cat:
[[0.9999584]]
Thank you for reading.