
A neural network is trained through back-propagation. It allows to adapt the weights in the layer successively.
Moving from the perceptron to a continuous activation function allows this.
The idea is to start from the end of the network and evaluate the difference between the output of the network and the target from the training set.
We use this as the error for the last layer. We also propagate the error to the last hidden layer and adapt the weights between the last and second-to-last layer accordingly.

The last layer is slightly different so we treat it first.
Let's look at the last set of parameters between the output and the hidden layer.
We have the loss function
$$ J = -\sum_{i,j} y^{(i)}_{j} \log(\hat y^{(i)}_{j})$$where $y^{(i)}_{j}$ is 1 if training sample $i$ is in class $j$, 0 otherwise.
$\hat y^{(i)}_{j}$ is the prediction of the model, here
$$ \hat y^{(i)}_{j} = s_j(z_j^{(i)} )$$with $z_j^{(i)}$ the linear combination of the last hidden layer with the last set of coefficients $w_{jk}$.
We can calculate the derivative of the loss function with respect to the linear combination in the last layer:
$$ \frac{\partial J}{\partial z_j^{(i)}} = \sum\limits_{l=1}^{n_o}\frac{\partial J}{\partial \hat y^{(i)}_{l}} \frac{\partial \hat y^{(i)}_{l}}{\partial z_j^{(i)}} = \sum\limits_{l=1}^{n_o} \left( -\frac{y^{(i)}_{l} }{ \hat y^{(i)}_{l} }\right) s_l\left(\delta_{lj}-s_j\right) $$with $n_o$ the number of units in the output layer.
$$ \frac{\partial J}{\partial w_{km}} = \sum\limits_{i=1}^{n_d}\sum\limits_{j=1}^{n_o}\frac{\partial J}{\partial z_j^{(i)}}\frac{\partial z_j^{(i)}}{\partial w_{km}} = \sum\limits_{i=1}^{n_d}\sum\limits_{j=1}^{n_o} \sum\limits_{l=1}^{n_o} \left( -\frac{y^{(i)}_{l} }{ \hat y^{(i)}_{l} }\right) s_l\left(\delta_{lj}-s_j\right) \frac{\partial z_j^{(i)}}{\partial w_{km}} $$with
$$ \frac{\partial z_j^{(i)}}{\partial w_{km}} = x^{(i)}_k \delta_{mj}$$we get
$$ \frac{\partial J}{\partial w_{km}} = \sum\limits_{i=1}^{n_d} \sum\limits_{l=1}^{n_o} \left( -\frac{y^{(i)}_{l} }{ \hat y^{(i)}_{l} }\right) s_l\left(\delta_{lm}-s_m\right) x^{(i)}_{k} $$We now look at a standard layer.

Now we can consider a normal layer with sigmoid activation with input $x$, parameters $w$ and output $y$. We denote with $z$ the linear combination of $x$ that enters the activation function (the sigmoid function).
$$ y_j = \phi(z_j) $$$$ z = W^T x $$We have $$ \frac{\partial y_j}{\partial z_{j}} = y_j(1-y_j)$$ and $$\frac{\partial y_j}{\partial w_{ij}} = \frac{\partial y_j}{\partial z_{j}} \frac{\partial z_j}{\partial w_{ij}} = \frac{\partial y_j}{\partial z_{j}} x_i = y_j(1-y_j) x_i$$
We can now calculate the derivative of the loss function with respect to the parameters of the connections:
We see that the derivative depends on the gradient of the loss with respect to the outputs of the layer we are considering. This output is the input into the next layer. So it will be useful to be able calculate the derivative of $J$ with respect to the layer input.
with
$$ \frac{\partial J}{\partial z_{j}} = \frac{\partial J}{\partial y_{j}} \frac{\partial y_j}{\partial z_{j}} = \frac{\partial J}{\partial y_{j}} y_j(1-y_j) $$To adapt the parameters of the network we start with the last layer and do for each layer:
The parameter adaptation goes in the opposite direction as the prediction, it is called back-propagation.
We need to break the symmetry between nodes at initialisation time otherwise all nodes will be updated together.
Training a neural network is challenging because:
Neural networks have many parameters and can easily overfit. To prevent it there are several options: