The problems we had with the perceptron were:

We need an algorithm that takes a more balanced approach:
If we set $x_0 = 1$ we can write this as
$$ z= \sum\limits_{j=0}^{n_f} x_j w_j = (x_0, x_1,...,x_D)\cdot (\omega_0, \omega_1,...,\omega_D) $$We can invert to obtain probability as a function of $z$ $$ p = \frac{1}{1+e^{-z}}\equiv \phi(z) $$


To optimize the parameters $w$ in the logistic regression fit to the log of the probability we calculate the gradient of the loss function $J$ and go in opposite direction.
$$ w_{j} \rightarrow w_j -\eta \frac{\partial J}{\partial w_j} $$$\eta$ is the learning rate, it sets the speed at which the parameters are adapted.
It has to be set empirically.
Finding a suitable eta is not always easy.

A too small learning rate leads to slow convergence.

Too large learning rate might spoil convergence altogether!

where $i$ runs over all data sample, $1\leq i \leq n_d$ and $j$ runs from 0 to $n_f$.
Algorithms:
Newton method: second order method Training:
use all the training data to compute gradient
two features relevant to the discrimination of benign and malignent tumors:

The data is not linearly separable.
We can train a sigmoid model to discriminate between the two types of tumors. It will assign the output class according to the value of
$$z= w_0 + \sum_j w_j x_j = w_0 + (x_1,x_2) (\omega_1,\omega_2)^T $$where $\omega_0$, $\omega_1$ and $\omega_2$ are chosen to minimize the loss.

The delimitation is linear because the relationship between parameters and features in the model is linear.
The logistic regression gives us an estimate of the probability of an example to be in the first class

It is often important to normalise the features. We want the argument of the sigmoid function to be of order one.
It is useful to normalise features such that their mean is 0 and their variance is 1.