Reference for cross-entropy, MSE, perplexity, and contrastive loss in training and evaluation
70% of ML interviews
Powers systems at Every training pipeline
Choosing right loss for task query improvement
TL;DR
Loss functions measure how wrong model predictions are. Cross-entropy is the standard for classification, MSE for regression, and contrastive losses for embeddings. Understanding perplexity helps evaluate language models.
Visual Overview
WHAT LOSS FUNCTIONS DO
+-----------------------------------------------------------+
| |
| Input --> Model --> Prediction |
| | |
| v compare |
| Target |
| | |
| v |
| Loss <-- single number |
| | (lower = better) |
| v |
| Gradient <-- direction to improve |
| |
+-----------------------------------------------------------+
Loss = how wrong the model is. Training = minimize loss.
Cross-Entropy Loss
Use for: Classification (the standard choice)
CROSS-ENTROPY FORMULA
+-----------------------------------------------------------+
| |
| CE = -SUM(y_true x log(y_pred)) |
| |
| For binary: |
| CE = -[y x log(p) + (1-y) x log(1-p)] |
| |
| Example (binary, true label = 1): |
| Model predicts 0.9 -> Loss = -log(0.9) = 0.105 (low) |
| Model predicts 0.1 -> Loss = -log(0.1) = 2.303 (high) |
| |
+-----------------------------------------------------------+
LOSS BY PREDICTION
+-----------------------------------------------------------+
| |
| Prediction True=1 Loss True=0 Loss |
| ---------------------------------------- |
| 0.99 0.01 4.61 |
| 0.90 0.11 2.30 |
| 0.50 0.69 0.69 |
| 0.10 2.30 0.11 |
| 0.01 4.61 0.01 |
| |
+-----------------------------------------------------------+
Intuition: Punishes confident wrong predictions severely.
Why Cross-Entropy (Not MSE) for Classification?
Reason 1: It’s Maximum Likelihood
MAXIMUM LIKELIHOOD
+-----------------------------------------------------------+
| |
| Goal: Find model that maximizes P(data | parameters) |
| |
| For classification: |
| P(correct labels) = PRODUCT(P(true_class_i)) |
| |
| Log likelihood (easier to work with): |
| log P = SUM(log P(true_class_i)) |
| |
| Minimizing negative log likelihood: |
| -log P = -SUM(log P(true_class_i)) |
| |
| This IS cross-entropy loss. |
| |
+-----------------------------------------------------------+
Reason 2: Better Gradients
MSE VS CROSS-ENTROPY GRADIENTS
+-----------------------------------------------------------+
| |
| True label: 1, Prediction: 0.01 (confident and WRONG) |
| |
| MSE gradient: |
| d/dp (p - 1)^2 = 2(p - 1) = 2(0.01 - 1) = -1.98 |
| |
| Cross-entropy gradient: |
| d/dp -log(p) = -1/p = -1/0.01 = -100 |
| |
| Cross-entropy: 50x larger gradient when confidently |
| wrong! Model learns faster from its worst mistakes. |
| |
+-----------------------------------------------------------+
The takeaway: MSE “shrugs” at confident wrong predictions. Cross-entropy screams.
Perplexity
Use for: Evaluating language models
PERPLEXITY
+-----------------------------------------------------------+
| |
| Perplexity = exp(cross-entropy) |
| |
| Or equivalently: |
| PPL = exp((-1/N) x SUM(log P(token_i))) |
| |
| Where N = number of tokens |
| |
+-----------------------------------------------------------+
PERPLEXITY EXAMPLES
+-----------------------------------------------------------+
| |
| PPL = 1 Model is certain (perfect prediction) |
| PPL = 10 Model choosing between ~10 equally likely |
| PPL = 100 Model is very uncertain |
| |
| Typical values: |
| GPT-2 on WikiText-103: ~20-30 PPL |
| GPT-3 175B: ~10-15 PPL |
| Fine-tuned on domain data: often < 10 PPL |
| |
+-----------------------------------------------------------+
Interpretation: “How many choices is the model confused between?”
Mean Squared Error (MSE)
Use for: Regression (predicting continuous values)
MSE
+-----------------------------------------------------------+
| |
| MSE = (1/n) x SUM((y_true - y_pred)^2) |
| |
| Example: |
| True: [3, 5, 7] |
| Pred: [2.5, 5.2, 6.8] |
| |
| MSE = [(0.5)^2 + (0.2)^2 + (0.2)^2] / 3 |
| = [0.25 + 0.04 + 0.04] / 3 |
| = 0.11 |
| |
+-----------------------------------------------------------+
Intuition: Squared term punishes large errors more than small ones.
Variant — MAE (Mean Absolute Error):
- More robust to outliers than MSE
- Use when you have outliers you don’t want to dominate training
Contrastive Loss
Use for: Embedding models, similarity learning
CONTRASTIVE LOSS
+-----------------------------------------------------------+
| |
| For positive pair (should be similar): |
| L = distance(a, b)^2 |
| |
| For negative pair (should be different): |
| L = max(0, margin - distance(a, b))^2 |
| |
| +--------------------------------------+ |
| | | |
| | anchor *-----* positive | |
| | ^ | |
| | minimize distance | |
| | | |
| | anchor * | |
| | v | |
| | maximize distance (up to | |
| | margin) | |
| | v | |
| | * negative | |
| | | |
| +--------------------------------------+ |
| |
+-----------------------------------------------------------+
Variations:
- Triplet loss: anchor, positive, negative together
- InfoNCE: used in CLIP, SimCLR — treat other batch items as negatives
- Multiple negatives ranking: efficient batch training for embeddings
Choosing Loss Functions
| Task | Loss | Why |
|---|---|---|
| Binary classification | Binary Cross-Entropy | Standard, good gradients |
| Multi-class classification | Categorical Cross-Entropy | Generalizes BCE to N classes |
| Regression | MSE or MAE | MSE for normal errors, MAE for outliers |
| Embedding/similarity | Contrastive/Triplet | Learns relative distances |
| Language modeling | Cross-Entropy (per token) | Predict next token distribution |
| Ranking | Pairwise/Listwise losses | Optimize ordering |
Debugging Loss
LOSS NOT DECREASING
+-----------------------------------------------------------+
| |
| Symptoms: |
| - Loss stays flat from the start |
| - Loss decreases then plateaus early |
| |
| Causes: |
| - Learning rate too low -> increase 10x |
| - Data issue (bad labels, wrong preprocessing) |
| - Wrong loss function for task |
| - Model too small for task |
| - Bug in data pipeline (same batch repeated) |
| |
| Debug steps: |
| 1. Overfit to single batch first (should reach ~0) |
| 2. Check a few examples manually |
| 3. Verify labels are correct |
| 4. Try larger learning rate |
| |
+-----------------------------------------------------------+
LOSS GOES TO NaN
+-----------------------------------------------------------+
| |
| Symptoms: |
| - Loss suddenly becomes NaN or Inf |
| - Gradients explode |
| |
| Causes: |
| - Learning rate too high |
| - Numerical instability (log(0), division by 0) |
| - Missing gradient clipping |
| - Bad initialization |
| |
| Debug steps: |
| 1. Reduce learning rate by 10x |
| 2. Add gradient clipping (max_grad_norm=1.0) |
| 3. Check for log(0) -- add epsilon |
| 4. Use mixed precision carefully |
| |
+-----------------------------------------------------------+
Common Gotchas
1. Label smoothing
Instead of: [0, 0, 1, 0]
Use: [0.025, 0.025, 0.925, 0.025]
Prevents overconfidence, improves generalization.
Typical smoothing: 0.1 (10% spread to other classes)
2. Class imbalance
Weight rare classes higher:
CE_weighted = -SUM(weight_class x y x log(p))
Or use focal loss:
FL = -(1-p)^gamma x log(p)
Focuses on hard examples, down-weights easy ones.
When This Matters
| Situation | What to know |
|---|---|
| Training a classifier | Use cross-entropy, not MSE |
| Evaluating LLM quality | Report perplexity |
| Loss stuck high | Debug: overfit single batch first |
| Train/val loss diverging | Overfitting — add regularization |
| Loss goes NaN | Reduce LR, add gradient clipping |
| Imbalanced classes | Use weighted loss or focal loss |
| Fine-tuning embeddings | Contrastive loss variants |