Loss Functions | Concepts

TL;DR

Loss functions measure how wrong model predictions are. Cross-entropy is the standard for classification, MSE for regression, and contrastive losses for embeddings. Understanding perplexity helps evaluate language models.

Visual Overview

WHAT LOSS FUNCTIONS DO
+-----------------------------------------------------------+
|                                                           |
|   Input --> Model --> Prediction                          |
|                           |                               |
|                           v compare                       |
|                        Target                             |
|                           |                               |
|                           v                               |
|                         Loss  <-- single number           |
|                           |       (lower = better)        |
|                           v                               |
|                       Gradient <-- direction to improve   |
|                                                           |
+-----------------------------------------------------------+

Loss = how wrong the model is. Training = minimize loss.

Cross-Entropy Loss

Use for: Classification (the standard choice)

CROSS-ENTROPY FORMULA
+-----------------------------------------------------------+
|                                                           |
|   CE = -SUM(y_true x log(y_pred))                         |
|                                                           |
|   For binary:                                             |
|     CE = -[y x log(p) + (1-y) x log(1-p)]                 |
|                                                           |
|   Example (binary, true label = 1):                       |
|     Model predicts 0.9 -> Loss = -log(0.9) = 0.105 (low)  |
|     Model predicts 0.1 -> Loss = -log(0.1) = 2.303 (high) |
|                                                           |
+-----------------------------------------------------------+

LOSS BY PREDICTION
+-----------------------------------------------------------+
|                                                           |
|   Prediction   True=1 Loss   True=0 Loss                  |
|   ----------------------------------------                |
|       0.99        0.01          4.61                      |
|       0.90        0.11          2.30                      |
|       0.50        0.69          0.69                      |
|       0.10        2.30          0.11                      |
|       0.01        4.61          0.01                      |
|                                                           |
+-----------------------------------------------------------+

Intuition: Punishes confident wrong predictions severely.

Why Cross-Entropy (Not MSE) for Classification?

Reason 1: It’s Maximum Likelihood

MAXIMUM LIKELIHOOD
+-----------------------------------------------------------+
|                                                           |
|   Goal: Find model that maximizes P(data | parameters)    |
|                                                           |
|   For classification:                                     |
|     P(correct labels) = PRODUCT(P(true_class_i))          |
|                                                           |
|   Log likelihood (easier to work with):                   |
|     log P = SUM(log P(true_class_i))                      |
|                                                           |
|   Minimizing negative log likelihood:                     |
|     -log P = -SUM(log P(true_class_i))                    |
|                                                           |
|   This IS cross-entropy loss.                             |
|                                                           |
+-----------------------------------------------------------+

Reason 2: Better Gradients

MSE VS CROSS-ENTROPY GRADIENTS
+-----------------------------------------------------------+
|                                                           |
|   True label: 1, Prediction: 0.01 (confident and WRONG)   |
|                                                           |
|   MSE gradient:                                           |
|     d/dp (p - 1)^2 = 2(p - 1) = 2(0.01 - 1) = -1.98       |
|                                                           |
|   Cross-entropy gradient:                                 |
|     d/dp -log(p) = -1/p = -1/0.01 = -100                  |
|                                                           |
|   Cross-entropy: 50x larger gradient when confidently     |
|   wrong! Model learns faster from its worst mistakes.     |
|                                                           |
+-----------------------------------------------------------+

The takeaway: MSE “shrugs” at confident wrong predictions. Cross-entropy screams.

Perplexity

Use for: Evaluating language models

PERPLEXITY
+-----------------------------------------------------------+
|                                                           |
|   Perplexity = exp(cross-entropy)                         |
|                                                           |
|   Or equivalently:                                        |
|     PPL = exp((-1/N) x SUM(log P(token_i)))               |
|                                                           |
|   Where N = number of tokens                              |
|                                                           |
+-----------------------------------------------------------+

PERPLEXITY EXAMPLES
+-----------------------------------------------------------+
|                                                           |
|   PPL = 1    Model is certain (perfect prediction)        |
|   PPL = 10   Model choosing between ~10 equally likely    |
|   PPL = 100  Model is very uncertain                      |
|                                                           |
|   Typical values:                                         |
|     GPT-2 on WikiText-103: ~20-30 PPL                     |
|     GPT-3 175B: ~10-15 PPL                                |
|     Fine-tuned on domain data: often < 10 PPL             |
|                                                           |
+-----------------------------------------------------------+

Interpretation: “How many choices is the model confused between?”

Mean Squared Error (MSE)

Use for: Regression (predicting continuous values)

MSE
+-----------------------------------------------------------+
|                                                           |
|   MSE = (1/n) x SUM((y_true - y_pred)^2)                  |
|                                                           |
|   Example:                                                |
|     True: [3, 5, 7]                                       |
|     Pred: [2.5, 5.2, 6.8]                                 |
|                                                           |
|     MSE = [(0.5)^2 + (0.2)^2 + (0.2)^2] / 3               |
|         = [0.25 + 0.04 + 0.04] / 3                        |
|         = 0.11                                            |
|                                                           |
+-----------------------------------------------------------+

Intuition: Squared term punishes large errors more than small ones.

Variant — MAE (Mean Absolute Error):

More robust to outliers than MSE
Use when you have outliers you don’t want to dominate training

Contrastive Loss

Use for: Embedding models, similarity learning

CONTRASTIVE LOSS
+-----------------------------------------------------------+
|                                                           |
|   For positive pair (should be similar):                  |
|     L = distance(a, b)^2                                  |
|                                                           |
|   For negative pair (should be different):                |
|     L = max(0, margin - distance(a, b))^2                 |
|                                                           |
|   +--------------------------------------+                |
|   |                                      |                |
|   |    anchor *-----* positive           |                |
|   |              ^                       |                |
|   |         minimize distance            |                |
|   |                                      |                |
|   |    anchor *                          |                |
|   |              v                       |                |
|   |         maximize distance (up to     |                |
|   |         margin)                      |                |
|   |              v                       |                |
|   |           * negative                 |                |
|   |                                      |                |
|   +--------------------------------------+                |
|                                                           |
+-----------------------------------------------------------+

Variations:

Triplet loss: anchor, positive, negative together
InfoNCE: used in CLIP, SimCLR — treat other batch items as negatives
Multiple negatives ranking: efficient batch training for embeddings

Choosing Loss Functions

Task	Loss	Why
Binary classification	Binary Cross-Entropy	Standard, good gradients
Multi-class classification	Categorical Cross-Entropy	Generalizes BCE to N classes
Regression	MSE or MAE	MSE for normal errors, MAE for outliers
Embedding/similarity	Contrastive/Triplet	Learns relative distances
Language modeling	Cross-Entropy (per token)	Predict next token distribution
Ranking	Pairwise/Listwise losses	Optimize ordering

Debugging Loss

LOSS NOT DECREASING
+-----------------------------------------------------------+
|                                                           |
|   Symptoms:                                               |
|     - Loss stays flat from the start                      |
|     - Loss decreases then plateaus early                  |
|                                                           |
|   Causes:                                                 |
|     - Learning rate too low -> increase 10x               |
|     - Data issue (bad labels, wrong preprocessing)        |
|     - Wrong loss function for task                        |
|     - Model too small for task                            |
|     - Bug in data pipeline (same batch repeated)          |
|                                                           |
|   Debug steps:                                            |
|     1. Overfit to single batch first (should reach ~0)    |
|     2. Check a few examples manually                      |
|     3. Verify labels are correct                          |
|     4. Try larger learning rate                           |
|                                                           |
+-----------------------------------------------------------+

LOSS GOES TO NaN
+-----------------------------------------------------------+
|                                                           |
|   Symptoms:                                               |
|     - Loss suddenly becomes NaN or Inf                    |
|     - Gradients explode                                   |
|                                                           |
|   Causes:                                                 |
|     - Learning rate too high                              |
|     - Numerical instability (log(0), division by 0)       |
|     - Missing gradient clipping                           |
|     - Bad initialization                                  |
|                                                           |
|   Debug steps:                                            |
|     1. Reduce learning rate by 10x                        |
|     2. Add gradient clipping (max_grad_norm=1.0)          |
|     3. Check for log(0) -- add epsilon                    |
|     4. Use mixed precision carefully                      |
|                                                           |
+-----------------------------------------------------------+

Common Gotchas

1. Label smoothing

Instead of: [0, 0, 1, 0]
Use:        [0.025, 0.025, 0.925, 0.025]

Prevents overconfidence, improves generalization.
Typical smoothing: 0.1 (10% spread to other classes)

2. Class imbalance

Weight rare classes higher:
CE_weighted = -SUM(weight_class x y x log(p))

Or use focal loss:
FL = -(1-p)^gamma x log(p)

Focuses on hard examples, down-weights easy ones.

When This Matters

Situation	What to know
Training a classifier	Use cross-entropy, not MSE
Evaluating LLM quality	Report perplexity
Loss stuck high	Debug: overfit single batch first
Train/val loss diverging	Overfitting — add regularization
Loss goes NaN	Reduce LR, add gradient clipping
Imbalanced classes	Use weighted loss or focal loss
Fine-tuning embeddings	Contrastive loss variants