Skip to content
~120s Visual Explainer

Backpropagation

How neural networks learn by propagating errors backward through layers.

Input Hidden Output x₁ x₂ h₁ h₂ ŷ w₁ w₂ w₃ w₄ Loss Neural network with 4 learnable weights Input Hidden Output x₁ 0.5 x₂ 0.8 h₁ 0.6 h₂ 0.4 ŷ 0.7 → Forward Pass → Data flows forward: prediction ŷ = 0.7 0.7 y=1.0 Target Loss Calculation L = (y - ŷ)² L = (1.0 - 0.7)² L = 0.09 Error: significant Loss = 0.09 — we need to improve! ŷ 0.09 ∂L/∂ŷ = -0.6 ← Chain Rule Gradient flows backward from loss ∂L/∂w₃ ∂L/∂w₄ ∂L/∂w₁ ∂L/∂w₂ 0.09 ∂L/∂w = ∂L/∂ŷ × ∂ŷ/∂h × ∂h/∂w Chain rule decomposes gradient ← Backward Pass ← Gradients computed for all weights w_new = w_old - α × ∂L/∂w α = 0.1 x₁ x₂ h₁ h₂ ŷ 0.85 Before: L = 0.09 After: L = 0.02 ↓ 78% better! Weights updated — prediction improves to 0.85
1 / ?

A Simple Neural Network

Let's visualize backpropagation with a simple network: 2 inputs, 2 hidden neurons, and 1 output. Each connection has a weight — a number that determines how strongly one neuron influences another.

Our goal: adjust these weights so the network makes accurate predictions.

  • Neurons (nodes) compute weighted sums plus activation
  • Weights are the learnable parameters
  • This network has 4 weights to learn

Forward Pass: Computing the Output

Data flows forward through the network. Inputs (x₁, x₂) are multiplied by weights, summed at each hidden neuron, passed through an activation function, then combined to produce the output.

This is a forward pass — inputs in, prediction out.

  • Each neuron computes: output = activation(Σ weights × inputs)
  • Forward pass is just matrix multiplication + activation
  • The final output is the network's prediction

How Wrong Are We?

We compare the prediction (ŷ = 0.7) to the actual target (y = 1.0). The loss function quantifies this error — here, squared error: (1.0 - 0.7)² = 0.09.

The larger the loss, the worse the prediction. Our job: minimize this loss.

  • Loss measures prediction error
  • Common losses: MSE, cross-entropy
  • Training = minimizing loss

Gradients Flow Backward

Now the magic: we ask "how does each weight contribute to the loss?" The answer is the gradient — the derivative of loss with respect to each weight.

We start at the output and work backward. The gradient ∂L/∂ŷ tells us how changes in the output affect the loss.

  • Gradient = direction of steepest loss increase
  • Negative gradient = direction to reduce loss
  • Computed via calculus (chain rule)

Chain Rule Through Layers

The chain rule lets us decompose the gradient through each layer. The gradient for w₃ combines the gradient from the output with the gradient through the activation.

At the hidden layer, gradients split — each hidden neuron receives gradients from all connections leading forward.

  • Chain rule: ∂L/∂w = ∂L/∂y × ∂y/∂h × ∂h/∂w
  • Gradients accumulate through layers
  • This is why it's called "backpropagation"

Learning: Adjusting Weights

Finally, we update weights in the opposite direction of the gradient. If a weight contributed to increasing the loss, we decrease it. The learning rate (α) controls step size.

After one update, the prediction improves: ŷ = 0.85. Repeat thousands of times and the network learns.

  • Update rule: w = w - α × gradient
  • Learning rate is a hyperparameter
  • Multiple iterations = training epochs
  • This is gradient descent

What's Next?

Backpropagation is the foundation of all modern deep learning. Next, explore optimization algorithms like Adam and SGD that make training faster and more stable.