Deep Neural Network
Figure 1: neural network example
each layer can be described mathematically as below, the process to compute through the layers to the final layer is what we call forward propagation:
\begin{align}
\label{eq:3}
z^{(l)} &= W^{(l)}a^{(l-1)} + b^{(l)} && (\text{if } l > 1) \\
a^{(l)} &= \begin{cases}
\sigma(z^{(l)}) & \text{if } l > 1 \\
x & \text{if } l = 1
\end{cases}
\end{align}
which is composed of a linear part \(z^{(l)}\) and an activation part \(a^{(l)}\), if use sigmoid as the activation function: \(\sigma(z^{(l)}) = \frac{1}{1 + e^{-z^{(l)}}} \)
for example, computation of layer 2 can be described as:
\begin{align}
\label{eq:4}
z^{(2)} &= W^{(2)}x + b^{(2)} \\
&= \begin{bmatrix}
w_{1,1}^{(2)} & w_{1,2}^{(2)} & w_{1,3}^{(2)} \\
w_{2,1}^{(2)} & w_{2,2}^{(2)} & w_{2,3}^{(2)} \\
w_{3,1}^{(2)} & w_{3,2}^{(2)} & w_{3,3}^{(2)} \\
w_{4,1}^{(2)} & w_{4,2}^{(2)} & w_{4,3}^{(2)} \\
\end{bmatrix}
\begin{bmatrix}
x_1 \\
x_2 \\
x_3
\end{bmatrix}
+
\begin{bmatrix}
b_1^{(2)} \\
b_2^{(2)} \\
b_3^{(2)} \\
b_4^{(2)}
\end{bmatrix} \\
\begin{bmatrix}
z_{1}^{(2)} \\
z_{2}^{(2)} \\
z_{3}^{(2)} \\
z_{4}^{(2)}
\end{bmatrix}
&= \begin{bmatrix}
w_{1,1}^{(2)} x_1 + w_{1,2}^{(2)} x_2 + w_{1,3}^{(2)} x_3 + b_1^{(2)} \\
w_{2,1}^{(2)} x_1 + w_{2,2}^{(2)} x_2 + w_{2,3}^{(2)} x_3 + b_2^{(2)} \\
w_{3,1}^{(2)} x_1 + w_{3,2}^{(2)} x_2 + w_{3,3}^{(2)} x_3 + b_3^{(2)} \\
w_{4,1}^{(2)} x_1 + w_{4,2}^{(2)} x_2 + w_{4,3}^{(2)} x_3 + b_4^{(2)} \\
\end{bmatrix} \\
a^{(2)} &= \sigma(z^{(2)}) \\
&= \begin{bmatrix}
\sigma(z_{1}^{(2)}) \\
\sigma(z_{2}^{(2)}) \\
\sigma(z_{3}^{(2)}) \\
\sigma(z_{4}^{(2)})
\end{bmatrix} \\
&= \begin{bmatrix}
\frac{1}{1 + e^{-z_{1}^{(2)}}} \\
\frac{1}{1 + e^{-z_{2}^{(2)}}} \\
\frac{1}{1 + e^{-z_{3}^{(2)}}} \\
\frac{1}{1 + e^{-z_{4}^{(2)}}}
\end{bmatrix}
\end{align}
loss (cost) function
choose a loss (cost) function, for example MSE (mean square error)
\begin{align}
\label{eq:5}
C &= \frac{1}{2} \| a^{(L)} - y \|_{2}^2
\end{align}
gradient descent
we use gradient descent to find the optimum parameters \(\theta\) (weights \(w\) and bias \(b\) are both part of \(\theta\)), so that the loss is minimized .
\[
\theta^{*} = arg \min_{\theta} C(\theta)
\]
To make \(C\) decrease, we apply the formula below in each iteration (each iteration we accumulate some delta of \(C\): \(\Delta C\)):
\[
\Delta C = \nabla C \Delta \theta
\]
update \(\theta\) in the opposite direction of \(\nabla C\) (the gradient of \(C\), a constant, the fastest direction to increase \(C\)), and with a learning rate \(\eta\)
\begin{align}
\label{eq:10}
\theta &= \theta + \Delta \theta \\
&= \theta -\eta \nabla C \\
\end{align}
Computing \(\nabla C\) is all we need, and we use backward propagation to do it.
Forward Propagation
forward propagation computes \(z^{(l)}\) and \(a^{(l)}\) for each layer sequentially from layer 1 to layer \(L\):
- start with the input: \(a^{(1)} = x\)
- for each layer \(l = 2, 3, \ldots, L\):
- compute the linear combination: \(z^{(l)} = W^{(l)}a^{(l-1)} + b^{(l)}\)
- apply the activation: \(a^{(l)} = \sigma(z^{(l)})\)
- the final output is \(a^{(L)}\), then compute the loss \(C = \frac{1}{2} | a^{(L)} - y |_2^2\)
in other words, forward propagation is the process of evaluating the composite function that represents the neural network, feeding input data through the network layer by layer until a prediction is produced.
Figure 2: forward propagation: data flows from input to output, each layer applies a linear transform followed by an activation function
Backward Propagation
after forward propagation produces a prediction \(a^{(L)}\) and the loss \(C\), backward propagation computes how much each parameter contributed to that loss. we need the gradient of all parameters: \(\frac{\partial C}{\partial W^{(l)}}\) and \(\frac{\partial C}{\partial b^{(l)}}\).
define the error signal (also called delta) at layer \(l\) as:
\begin{align}
\label{eq:8}
\delta^{(l)} &= \frac{\partial C}{\partial z^{(l)}} \\
&= \frac{\partial C}{\partial a^{(l)}} \odot \frac{\partial a^{(l)}}{\partial z^{(l)}} \\
&= \begin{cases}
(a^{(L)} - y) \odot \sigma’(z^{(L)}) & \text{if}\ l = L \\
(\frac{\partial z^{(l+1)}}{\partial a^{(l)}} \frac{\partial C}{\partial z^{(l+1)}}) \odot \sigma’(z^{(l)}) & \text{if}\ l<L \\
\end{cases} \\
&= \begin{cases}
(a^{(L)} - y) \odot \sigma’(z^{(L)}) & \text{if } l = L \\
((W^{(l+1)})^{T}\delta^{(l+1)}) \odot \sigma’(z^{(l)}) & \text{if } l<L \\
\end{cases} \\
\frac{\partial C}{\partial W^{(l)}} &= \frac{\partial C}{\partial z^{(l)}} \frac{\partial z^{(l)}}{\partial W^{(l)}} \\
&= \delta^{(l)}(a^{(l-1)})^{T} \\
\frac{\partial C}{\partial b^{(l)}} &= \frac{\partial C}{\partial z^{(l)}} \frac{\partial z^{(l)}}{\partial b^{(l)}} \\
&= \delta^{(l)} \\
\end{align}
Figure 3: backward propagation: error signals flow from output layer back to input, computing gradients for each layer along the way
take the neural network (where \(L = 3\)) as an example:
\begin{align}
\label{eq:11}
\delta^{(l)} &= \begin{cases}
(a^{(L)} - y) \odot \sigma’(z^{(L)}) & \text{if } l = L \\
((W^{(l+1)})^{T}\delta^{(l+1)}) \odot \sigma’(z^{(l)}) & \text{if } l < L
\end{cases} \\
&= \begin{cases}
\begin{bmatrix}
a_{1}^{(3)} - y_{1} \\
a_{2}^{(3)} - y_{2}
\end{bmatrix} \odot \begin{bmatrix}
\sigma’(z_{1}^{(3)}) \\
\sigma’(z_{2}^{(3)})
\end{bmatrix} & \text{if } l = L = 3 \\
(\begin{bmatrix}
w_{1,1}^{(3)} & w_{2,1}^{(3)} \\
w_{1,2}^{(3)} & w_{2,2}^{(3)} \\
w_{1,3}^{(3)} & w_{2,3}^{(3)} \\
w_{1,4}^{(3)} & w_{2,4}^{(3)}
\end{bmatrix}
\begin{bmatrix}
\delta_{1}^{(3)} \\
\delta_{2}^{(3)}
\end{bmatrix}
) \odot \begin{bmatrix}
\sigma’(z_{1}^{(2)}) \\
\sigma’(z_{2}^{(2)}) \\
\sigma’(z_{3}^{(2)}) \\
\sigma’(z_{4}^{(2)})
\end{bmatrix} & \text{if } l = 2 \\
\end{cases} \\
\frac{\partial C}{\partial W^{(l)}} &= \delta^{(l)} (a^{(l-1)})^{T} \\
&= \begin{cases}
\begin{bmatrix}
\delta_{1}^{(3)} \\
\delta_{2}^{(3)}
\end{bmatrix} \begin{bmatrix}
a_{1}^{(2)} & a_{2}^{(2)} & a_{3}^{(2)}
\end{bmatrix} & \text{if } l = 3 \\
\begin{bmatrix}
\delta_{1}^{(2)} \\
\delta_{2}^{(2)} \\
\delta_{3}^{(2)} \\
\delta_{4}^{(2)}
\end{bmatrix} \begin{bmatrix}
x_{1} & x_{2} &x_{3}
\end{bmatrix} & \text{if } l = 2
\end{cases} \\
\frac{\partial C}{\partial b^{(l)}} &= \delta^{(l)} \\
&= \begin{cases}
\begin{bmatrix}
\delta_{1}^{(3)} \\
\delta_{2}^{(3)}
\end{bmatrix} & \text{if } l = 3 \\
\begin{bmatrix}
\delta_{1}^{(2)} \\
\delta_{2}^{(2)} \\
\delta_{3}^{(2)} \\
\delta_{4}^{(2)}
\end{bmatrix} & \text{if } l = 2
\end{cases}
\end{align}
the full training loop
with all gradients computed by backward propagation, the complete gradient descent algorithm is:
- initialize all weights \(W^{(l)}\) and biases \(b^{(l)}\) with small random values
- repeat until convergence (or for a fixed number of epochs):
- forward pass: compute \(a^{(l)}\) for each layer \(l = 1 \ldots L\), then compute the loss \(C\)
- backward pass: compute \(\delta^{(l)}\) for each layer \(l = L, L-1, \ldots, 2\) using the chain rule
- compute gradients: \(\frac{\partial C}{\partial W^{(l)}} = \delta^{(l)}(a^{(l-1)})^{T}\), \(\frac{\partial C}{\partial b^{(l)}} = \delta^{(l)}\)
- update parameters: for each layer, \(W^{(l)} \leftarrow W^{(l)} - \eta \frac{\partial C}{\partial W^{(l)}}\) and \(b^{(l)} \leftarrow b^{(l)} - \eta \frac{\partial C}{\partial b^{(l)}}\)