Backpropagation

This is the “engine” of learning. We just calculated the Loss (LL). Now we need to know: “How much did each weight contribute to this error?”

If we know that increasing weight w11w_{11} by a tiny bit increases the error, then we should decrease w11w_{11}. This “sensitivity” is called a Gradient.

We calculate these gradients using the Chain Rule of calculus, propagating the error backward from the output to the input.

dzdx=dzdydydx\frac{dz}{dx} = \frac{dz}{dy} \cdot \frac{dy}{dx}

Let’s look at the simple network example that we saw above:

First, let’s calculate the loss function for the output aa:

L=(aTarget)2=(ReLU(z)0)2=(ReLU(sum(mul(x1,w1),mul(x2,w2),mul(x3,w3),mul(x4,w4),b)))2\begin{aligned} L &= (a - \text{Target})^2 \\ &= (\text{ReLU}(\text{z}) - 0)^2 \\ &= \left( \text{ReLU} \left( \text{sum} \left( \begin{aligned} &\text{mul}(x_1, w_1), \text{mul}(x_2, w_2), \\ &\text{mul}(x_3, w_3), \text{mul}(x_4, w_4), b \end{aligned} \right) \right) \right)^2 \end{aligned}

(Note: In this specific example, we set the Target to 0 to simplify the math. We want the neuron to learn to output 0.)

To find the gradient of the Loss with respect to a specific weight (e.g., w11w_{11} connecting Input 1 to Neuron 1), we use the chain rule. We trace the path from the Loss back to the weight:

Path: Loss \rightarrow ReLU \rightarrow Sum \rightarrow Mul \rightarrow w11w_{11}

Lossw11=LossReLUReLUsumsummulmulw11\frac{\partial \text{Loss}}{\partial w_{11}} = \frac{\partial \text{Loss}}{\partial \text{ReLU}} \cdot \frac{\partial \text{ReLU}}{\partial \text{sum}} \cdot \frac{\partial \text{sum}}{\partial \text{mul}} \cdot \frac{\partial \text{mul}}{\partial w_{11}}

Using the values from our actual code execution (where x1=1.0x_1=1.0, z0.964z \approx 0.964, a0.964a \approx 0.964):

  1. LossReLU\frac{\partial \text{Loss}}{\partial \text{ReLU}}: The derivative of Mean Squared Error (1n(ay)2\frac{1}{n}\sum(a-y)^2) with respect to aa. Since we have n=3n=3 neurons:
    • Formula: 23(aTarget)\frac{2}{3}(a - \text{Target})
    • Result: 23(0.9640)0.642\frac{2}{3}(0.964 - 0) \approx 0.642
  2. ReLUsum\frac{\partial \text{ReLU}}{\partial \text{sum}}: Since z(0.964)>0z (0.964) > 0, the slope is 11.
  3. summul\frac{\partial \text{sum}}{\partial \text{mul}}: 11.
  4. mulw11\frac{\partial \text{mul}}{\partial w_{11}}: Input x1=1.0x_1 = 1.0.

Final Gradient for w11w_{11}:

Lossw11=0.642111.0=0.642\frac{\partial \text{Loss}}{\partial w_{11}} = 0.642 \cdot 1 \cdot 1 \cdot 1.0 = 0.642

This positive gradient tells us that increasing w11w_{11} will increase the error, so we should decrease it.

    Mike 3.0

    Send a message to start the chat!

    You can ask the bot anything about me and it will help to find the relevant information!

    Try asking: