Gradient Descent for a Single Artificial Neuron

This article explores the gradient descent algorithm for a single artificial neuron. The activation function is set to the logistic function.

Logistic Function

The logistic function is the most common kind of sigmoid function. It is defined as follows

    \begin{equation*} \sigma(u) = \frac{1}{1 + e^{-u}} \end{equation*}

Properties

The logistic function has two primary good properties

  1. The output y is always between 0 and 1.
  2. Unlike a unit step function, \sigma(u) is smooth and differentiable, making the derivation of update equation very easy.

\sigma(-u) = 1 - \sigma(u)

    \begin{align*} 	1 - \sigma(u) &= 1 - \frac{1}{1 + e^{-u}} \\ &= \frac{e^{-u}}{1 + e^{-u}} \\ &= \frac{1}{\frac{1}{e^{-u}} + 1} \\ &= \frac{1}{1 + e^{u}} \\ &= \sigma(-u) \qed \\ \end{align*}

\frac{d\sigma(u)}{du} = \sigma(u)\sigma(-u)

    \begin{align*} 	\frac{d\sigma(u)}{du} &= \frac{d}{du} \Big(\frac{1}{1 + e^{-u}}\Big) \\ &= \frac{d}{du} \Big((1 + e^{-u})^{-1}\Big) \\ &= -(1 + e^{-u})^{-2} \frac{d}{du} (1 + e^{-u}) \\ &= -(1 + e^{-u})^{-2}e^{-u} \frac{d}{du} (-u) \\ &= (1 + e^{-u})^{-2}e^{-u} \\ &= \frac{1}{1 + e^{-u}}\frac{e^{-u}}{1 + e^{-u}} \\ &= \frac{1}{1 + e^{-u}}\frac{1}{e^{u} + 1} \\ &= \sigma(u)\sigma(-u) \qed \end{align*}

Single Artificial Neuron

Notations

  • {x_1, \cdots ,x_K} are input values
  • {w_1, \cdots ,w_K} are weights
  • y is a scalar output
  • f is the activation function (also called decision/transfer function)
  • t is the label (gold standard)
  • \eta is is the learning rate (\eta > 0)

The unit works in the following way

    \begin{equation*}      y = f(u) \end{equation*}

where u is a scalar number, which is the net input of the neuron.

    \begin{equation*}      u = \sum_{i=0}^kw_ix_i \end{equation*}

In vector notation

    \begin{equation*}      u = \boldsymbol{w}^T\boldsymbol{X} \end{equation*}

To include a bias term, simply add an input dimension (e.g., x_0) that is constant 1.

The error function (the training objective) is defined as

    \begin{equation*}      E = \frac{1}{2}(t - y)^2 \end{equation*}

Use stochastic gradient descent as the learning algorithm of this model – take the derivative of E with regard to w_i

    \begin{align*}  \frac{\partial E}{\partial w_i} &= \frac{\partial E}{\partial y}\cdot\frac{\partial y}{\partial u}\cdot\frac{\partial u}{\partial w_i} \\  \frac{\partial E}{\partial y} &= \frac{\partial}{\partial y}\Big(\frac{1}{2}(t - y)^2\Big) = (y - t) \\  \frac{\partial y}{\partial u} &= \frac{\partial}{\partial u}\sigma(u) = \sigma(u)\sigma(-u) = \sigma(u)(1 - \sigma(u)) = y(1 - y) \\  \frac{\partial u}{\partial w_i} &= \frac{\partial}{\partial w_i}\Big(\sum_{i=0}^kw_ix_i\Big) = x_i \\ \end{align*}

Thus,

    \begin{equation*}      \frac{\partial E}{\partial w_i} = (y-t) \cdot y(1-y) \cdot x_i \end{equation*}

Once the derivative is computed, simply apply stochastic gradient descent for all samples

    \begin{equation*}      \boldsymbol{w}^{(new)} =\boldsymbol{w}^{(old)} - \eta \cdot (y - t) \cdot y(1 - y) \cdot \boldsymbol{x}  \end{equation*}


References

Leave a Reply

Your email address will not be published. Required fields are marked *