Single layer neural network

Logistic regression is a supervised learning algorithm that uses the sigmoid or softmax activation functions on output layer for predictions based on if it is binary or multi class classification. This article assumes binary classification where sigmoid activation function outputs probabilities between 0 and 1. The data is structured with an X matrix of n features and a Y matrix of n labels, where matrix X has a shape of (n, m) in Python. The algorithm optimizes parameters using the cross-entropy loss function, which is generally preferred over mean squared error. The optimization is performed through gradient descent, iteratively updating weights and bias by calculating derivatives of the cost function. Forward and backward propagation are used to compute the gradients needed for these updates. The article details the complete process of deriving and applying these derivatives, especially focusing on the derivatives with respect to the activation function and the computation graph for a single-layer neural network

Lets say we have m training examples and each example with n features, this whole data set can be depicted as shown below in a matrix as

$$X= \\\begin{bmatrix} x^1 & x^2&...x^m \\ x^1 & x^2 &...x^m \\ \vdots &\vdots&\vdots\\ x^1_n & x^2_n &... x^m_n \end{bmatrix}\\ , Y=\\\begin{bmatrix} y^1 ,& y^2&...y^m \end{bmatrix}\\$$

here the shape of the matrix X in python will be X.shape= (n,m) and shape of y will be Y.shape=(1.m)

prediction in logistic regression is defined through sigmoid function as

$$\begin{flalign*} &\hat{y}= \sigma (Z)=\sigma(W^Tx+b) &\\ \end{flalign*}$$

here Z is

$$\begin{flalign*} &Z=W^Tx+b &\\ \end{flalign*}$$

Where W is weights matrix and b is bias variable. For ith example we can write this as

$$\begin{flalign*} &\hat{y}^i=\sigma(W^Tx^i+b) &\\ \end{flalign*}$$

Loss function shown here is cross entropy which always perform better then mean square error(MSE) and is given as

$$\begin{flalign*} &L(\hat{y},y)=-(y \log{}\hat{y} +(1-y)\log(1-\hat{y})) \space\space\space\space\cdots\cdots\cdots\cdots\cdots1 &\\ \end{flalign*}$$

Loss function is a measure of how satisfactory the model prediction are.

loss function is calculated for each example and Cost function is the average of all loss functions over m training examples. Cost function is defined as

$$J(w,b)=1/m \sum_1^m L(\hat{y^i},y^i)=-1/m\sum_1^m y^i \log{}\hat{y^i} +(1-y^i)\log(1-\hat{y^i}))\space\space\space\space\cdots\cdots\cdots2$$

Gradient Descent is about adjusting weights so that local minima can be reached on curve of cost function and weights. Basically we are saying to adjust the weights and bias so that we minimize the loss function. Derivatives of Cost function with respect to W and b are calculated and then weights are adjusted as part of back propagation. This can be depicted as

$$repeat \\\begin{Bmatrix} w=w-\alpha\frac{dJ(w,b)}{dw}\\ b=b-\alpha\frac{dJ(w,b)}{db} \end{Bmatrix}\\ m$$

Deriving derivative for logistic regression

Lets say we have a dataset where training example has two features

$$\begin{flalign*} &z=(w_1x_1+w_2x_2+b) &\\ \end{flalign*}$$

where w1 and w2 are weights , x1 and x2 are features in this data set.

following is a computation graph for one layer neural network , we will create the steps for forward propagation and backword propagation.

$figure\space1 \space single \space layer \space neural \space network$

Hidden layer performs two activities as forward pass , calculating linear function z and then calculating the activation function, in this case of logistic regression sigmoid activation is used as we want prediction as either 0 or 1.

From Derivative chain rule

$$\begin{flalign*} & \frac{dL(a,y)}{dz}= \frac{dL(a,y)}{da}* \frac{da}{dz}\space\space\space\space\cdots\cdots\cdots\cdots\cdots3 &\\ \end{flalign*}$$

lets first calculate the the derivative of loss with respect to activation function a

$$\begin{flalign*} & \frac{d}{da}(-(y \log{}\hat{y} +(1-y)\log(1-\hat{y}))) \space\space\space\space\cdots\cdots\cdots\cdots\cdots4 &\\ \end{flalign*}$$

lets first calculate the derivate of $(-(y \log{}\hat{y})$ or $(-(y \log{}a)$. Derivative of partial part of loss function with respect to a will be $-\frac{y}{a}$.

Derivative of second partial part of loss function $-(1-y)\log(1-a))) $ will be $\frac{1-y}{1-a}$ together the whole equation is written as

$$\begin{flalign*} & \frac{dL}{da}=-\frac{y}{a}+\frac{1-y}{1-a}\space\space\space\space\cdots\cdots\cdots\cdots\cdots5 &\\ \end{flalign*}$$

which can be solved to $\frac{dL}{da}=\frac{a-y}{a(1-a)}$

Now lets calculate the derivative of $\frac{da}{dz}$or $\frac{d}{dz}\sigma(z)$.

$\frac{d}{dz}\sigma(z)=\frac{d}{dz}( \frac{1}{1+e^-z})$, solving further $\frac{d}{dz}\sigma(z)=\frac{d}{dz}( {1+e^{-z}})^{-1}$

This is broken down by applying chain rule

$\frac{d}{dz}\sigma(z)= -1*( {1+e^{-z}})^{-2} * \frac{d}{dz}( {e^{-z}})\space\space\space\space\cdots\cdots\cdots\cdots\cdots6$

$\frac{d}{dz}( {e^{-z}})$ is further solved using derivative of exponents which is defined as

$\frac{d}{dz}a^{f(x)}=a^{f^{'}()x}.\log{a}.{f^{'}()x}$

Solving this gives $\frac{d}{dz}( {e^{-z}})=e^{-z}.\log{e}.(-1)=e^{-z}.1.(-1)=-e^{-z}$

Equation 6 can be summarized as $\frac{da}{dz}= -1({1+e^{-z}})^{-2}.e^{-z}.(-1)=\frac{1}{(1+e^{-z})}(1-\frac{1}{1+e^{-z}})$\= a(1-a)

Now lets complete the full equation 3 by replacing its derived components. $\frac{dL}{dz}=\frac{dL}{da}.\frac{da}{dz}$

$$\begin{flalign*} & \frac{dL}{dz}=\frac{a-y}{a(1-a)}.a(1-a)=a-y\space\space\space \cdots\cdots\cdots\cdots7 &\\ \end{flalign*}$$

Now we will calculate the gradient of loss with respect to weights $w_1,w_2$ and bias $b$

$\frac{ dL}{dw_1}=\frac{dL}{dz}.\frac{dz}{dw_1}$ arrow 2 in $figure\space1$. $\frac{dl}{dw_1}=\frac{dL}{dz}.\frac {d}{dw_1}(w_1x_1)=x_1\frac{dL}{dZ}$

similarly

$\frac{dl}{dw_2}=\frac{dL}{dz}.\frac {d}{dw_2}(w_1x_1)=x_2\frac{dL}{dZ}$ and $\frac{dl}{dw_1}=\frac{dL}{dz}.\frac {dL}{db}=\frac{dL}{dz}$

to summarize

$$\begin{flalign*} & \frac{dl}{dz}=a-y &\\ & \frac{dl}{dw_1}=x_1\frac{dL}{dZ} &\\ & \frac{dl}{dw_2}=x_2\frac{dL}{dZ} &\\ & \frac{dl}{db}=\frac{dL}{dZ} &\\ \end{flalign*}$$

Now weights can be adjusted as

$$\begin{flalign*} &w_1=w_1-\alpha\frac{dL}{dw_1} &\\ & w_2=w_2-\alpha\frac{dL}{dw_2}&\\ & b=b-\alpha\frac{dL}{dz}&\\ \end{flalign*}$$

where $\alpha$is a learning rate. Cost function is the average of loss function for all m training examples

$J(w,b)=1/m \sum_1^m L(\hat{a^i},y^i)$

This is calculated for each training example and then summed up and averaged to calculate cost function for m training examples.

$\frac{d}{w_1}J(w_1,b)=1/m \sum_1^{i=m} \frac{d}{dw_1}L(\hat{a^i},y^i)$

$\frac{d}{w_2}J(w_2,b)=1/m \sum_1^{i=m} \frac{d}{dw_2}L(\hat{a^i},y^i)$

$\frac{d}{db}J(w,b)=1/m \sum_1^{i=m} \frac{d}{db}L(\hat{a^i},y^i)$